‘Carballo’, the first large-scale language model in history for Galician, is born at USC

CiTIUS and ILG (Instituto da Lingua Galega) present the first Artificial Intelligence linguistic model for Galician: a historic step within the ‘Nós Project’, enabling the development of technological tools and intelligent systems specifically created for the native language.

The Nós Project , developed by CiTIUS (Centre for Research in Intelligent Technologies) and ILG (Institute of the Galician Language), has announced the creation of Carballo : a high-quality large-scale language model in Galician, enabling the development of new generative AI tools and applications for Galicia’s own language.

Carballo is a large-scale language model, the largest ever created for Galician. It is a model known as foundational, as the basic - and essential - piece for building versatile and high-quality tools through generative AI with linguistic technology, such as chatbots, translators, or automatic correctors.

As with other foundational models, Carballo still requires small technical adaptations to become a dialogue system capable of maintaining a fluent conversation and offering automatic responses with simple and intuitive interaction.

However, drawing an analogy with the most well-known generative AI currently used worldwide (ChatGPT, owned by OpenAI), it is important to note that the resulting tool (Chat) would not exist without the foundational model supporting it (GPT). Foundational models are not adapted or fine-tuned through instructions aimed at solving specific tasks, and therefore are not designed for direct public use. However, these models represent an essential step towards the development of disruptive AI applications in the field of linguistics, as we already know them today.

In the depths of Carballo

Carballo is the result of two research projects: Nós, driven by the Galician Government, and ILENIA , promoted by the Ministry for Digital Transformation and Public Function to boost all official languages of the State. In this regard, the Galician model is based on Flor1.3, the homologous model previously developed for Catalan within the framework of the AINA-ILENIA project, developed at the Barcelona Supercomputing Centre (BSC-CNS).

Carballo has a GPT architecture with 1.3 billion ‘parameters’. In other words: 1.3 billion values fine-tuned through a training process with text corpora, aimed at ensuring the model develops high competence in using Galician; its training was a significant computational challenge, requiring collaboration from the CESGA (Galician Supercomputing Centre), which has the second most powerful supercomputer in the entire national territory.

For Carballo's training, a massive corpus of Galician texts was used, called CorpusNós , comprising approximately 2.1 billion words: the largest textual corpus in Galician to date . A significant part of this corpus was developed within the Nós Project itself, under numerous agreements and data transfer arrangements with companies and organizations providing textual data. Thus, in this cooperative development cycle, media such as NósDiario, PrazaPública, or CRTVG; publishers Galaxia and Laiovento; and various public institutions such as the Parliament of Galicia, the Council of Galician Culture, the provincial governments of A Coruña and Lugo, or the Royal Galician Academy, among many others, have participated from 'raw' data.

Free and open resources

The ILENIA project, promoted by the Ministry for Digital Transformation and Public Function, aims to generate digital resources enabling the development of multilingual applications in the various official languages of the State. Alongside USC (Nós, Galician) and BSC-CNS (AINA, Catalan), the project also involves the CENID centres (VIVES project, Valencian) and HiTZ (NEL-GAITU project, Basque).

The foundational model Carballo for Galician is a further step in this strategy of having scientific and technological capabilities independent of large corporations not aligned with the social and cultural reality of Galicia , creating open and free resources so that other companies and institutions can develop linguistic technologies in Galician of broad social and even economic interest. It is about helping to create a dynamic business environment that grows with the latest advances in artificial intelligence and revolves around the Galician language, also enhancing relations with the Lusophone sphere and thus with the Portuguese language market, close to 300 million speakers. It should be noted that, alongside Carballo, the first foundational Galician and Portuguese model, Carvalho , was also developed in collaboration with the University of Évora to strengthen our language through the inclusion of European Portuguese.

Carballo was publicly released , with the aim that both experts and software companies can use the model to develop new products, make adjustments, or even integrate its use into applications useful for the general public.

From CiTIUS, it is emphasised that Carballo was developed "in line with the guidelines guiding ‘Trusted AI’", a paradigm of ‘responsible’ Artificial Intelligence aligned with the TrustWorthy principles set out in the European AI Regulation, the world's first artificial intelligence law recently passed by the European Parliament. This initiative was also co-financed by the European Union through the Galicia ERDF 2021-2027 Program.

Regarding the Nós Project, the team responsible for Carballo continues to work on improving the model's quality, as well as increasing the size of new foundational models and adapting them to handle multiple tasks, as widely used commercial tools such as ChatGPT already do. For now, a demonstrator is already available , allowing basic use of the model along with some pre-built examples.