CorpusNÓS: A massive Galician corpus for training large language models

We present a systematic analysis of the influ- ence of vocabulary size on the performance of Neural Machine Translation (NMT) models, with a particular focus on Galician language models (Basque-Galician, Catalan-Galician, and English-Galician). The study encompasses an exploration of varying vocabulary sizes em- ploying the Byte Pair Encoding (BPE) subword segmentation methodology, with a particular emphasis on BLEU scores. Our results reveal a consistent preference for smaller BPE mod- els. This preference persists across different scales of training data. The study underscores the importance of vocabulary size in NMT, pro- viding insights for languages with varying data volumes

keywords: Corpus, Galician