Exploring the effects of vocabulary size in neural machine translation: Galician as a target language
We present a systematic analysis of the influence of vocabulary size on the performance of Neural Machine Translation (NMT) models, with a particular focus on Galician language models (Basque-Galician, Catalan-Galician, and English-Galician). The study encompasses an exploration of varying vocabulary sizes employing the Byte Pair Encoding (BPE) subword segmentation methodology, with a particular emphasis on BLEU scores. Our results reveal a consistent preference for smaller BPE models. This preference persists across different scales of training data. The study underscores the importance of vocabulary size in NMT, providing insights for languages with varying data volumes.
keywords:
Publication: Congress
1736864022206
January 14, 2025
/research/publications/exploring-the-effects-of-vocabulary-size-in-neural-machine-translation-galician-as-a-target-language
We present a systematic analysis of the influence of vocabulary size on the performance of Neural Machine Translation (NMT) models, with a particular focus on Galician language models (Basque-Galician, Catalan-Galician, and English-Galician). The study encompasses an exploration of varying vocabulary sizes employing the Byte Pair Encoding (BPE) subword segmentation methodology, with a particular emphasis on BLEU scores. Our results reveal a consistent preference for smaller BPE models. This preference persists across different scales of training data. The study underscores the importance of vocabulary size in NMT, providing insights for languages with varying data volumes. - Daniel Bardanca Outeirinho, Pablo Gamallo, Iria de-Dios-Flores, José Ramom Pichel Campos
publications_en