CorpusNÓS: A massive Galician corpus for training large language models
We present a systematic analysis of the influ- ence of vocabulary size on the performance of Neural Machine Translation (NMT) models, with a particular focus on Galician language models (Basque-Galician, Catalan-Galician, and English-Galician). The study encompasses an exploration of varying vocabulary sizes em- ploying the Byte Pair Encoding (BPE) subword segmentation methodology, with a particular emphasis on BLEU scores. Our results reveal a consistent preference for smaller BPE mod- els. This preference persists across different scales of training data. The study underscores the importance of vocabulary size in NMT, pro- viding insights for languages with varying data volumes
keywords: Corpus, Galician
Publication: Congress
1728981994255
October 15, 2024
/research/publications/corpusnos-a-massive-galician-corpus-for-training-large-language-models
We present a systematic analysis of the influ- ence of vocabulary size on the performance of Neural Machine Translation (NMT) models, with a particular focus on Galician language models (Basque-Galician, Catalan-Galician, and English-Galician). The study encompasses an exploration of varying vocabulary sizes em- ploying the Byte Pair Encoding (BPE) subword segmentation methodology, with a particular emphasis on BLEU scores. Our results reveal a consistent preference for smaller BPE mod- els. This preference persists across different scales of training data. The study underscores the importance of vocabulary size in NMT, pro- viding insights for languages with varying data volumes - Daniel Bardanca Outeirinho, Pablo Gamallo, Iria de-Dios-Flores, José Ramom Pichel Campos
publications_en