A Galician-Portuguese Generative Model
Large language models (LLMs) have revolutionized natural language processing, but their predominant focus on English has resulted in biases and performance differences across various languages. This situation is maintained in generative multilingual models, where English continues to be the predominant language. In these models, the presence of European Portuguese is marginal and that of the Galician variety is almost residual. In this work, we describe an open-source Galician-Portuguese generative model, Carvalho_pt-gl, focused precisely on these two language variants, which are very close lexically and syntactically. The model was trained using a GPT architecture with 1.3 billion parameters on more than 6B words, balanced between the two varieties. The strategy of continual pertaining was used to adapt a pre-existing LLM that was trained on a trilingual dataset with related languages, thereby overcoming the data limitations that would be faced if the training was started from scratch. Evaluation results involving task-based datasets from standardized benchmarks indicate a promising performance. These findings highlight the critical importance of supporting linguistic diversity in generative models.
keywords: Continual Pretraining, Galician, Generative Models, Large Language Models, Portuguese
Publication: Congress
1736941214482
January 15, 2025
/research/publications/a-galician-portuguese-generative-model
Large language models (LLMs) have revolutionized natural language processing, but their predominant focus on English has resulted in biases and performance differences across various languages. This situation is maintained in generative multilingual models, where English continues to be the predominant language. In these models, the presence of European Portuguese is marginal and that of the Galician variety is almost residual. In this work, we describe an open-source Galician-Portuguese generative model, Carvalho_pt-gl, focused precisely on these two language variants, which are very close lexically and syntactically. The model was trained using a GPT architecture with 1.3 billion parameters on more than 6B words, balanced between the two varieties. The strategy of continual pertaining was used to adapt a pre-existing LLM that was trained on a trilingual dataset with related languages, thereby overcoming the data limitations that would be faced if the training was started from scratch. Evaluation results involving task-based datasets from standardized benchmarks indicate a promising performance. These findings highlight the critical importance of supporting linguistic diversity in generative models. - Gamallo P., Rodriguez P., Santos D., Sotelo S., Miquelina N., Paniagua S., Schmidt D., de-Dios-Flores I., Quaresma P., Bardanca D., Pichel J.R., Nogueira V., Barro S. - 10.1007/978-3-031-73503-5_24
publications_en