Increasing manually annotated resources for Galician: the Parallel Universal Dependencies Treebank
This paper presents the development of the Parallel Universal Dependencies (PUD) treebank for Galician. PUD treebanks were originally created for the CoNLL 2017 Shared Task on Multilingual Parsing, and have subsequently been used both to develop NLP tools and to perform cross-linguistic analysis using parallel resources. The Galician PUD consists of 1000 sentences manually reviewed by professional translators and aligned with the other 23 available PUD treebanks. The linguistic annotation was first carried out using state-of-the-art NLP tools for Galician, and then reviewed by two experts, achieving a high inter-annotator agreement. We describe the process of translating, pre-processing, and reviewing the corpus, and discuss the annotation of some linguistic phenomena in comparison with other PUD treebanks. The release of Galician PUD will double the size of the available treebanks for this linguistic variety, as only 1000 reviewed sentences were available to date. It will also be useful for carrying out cross-linguistic analyses including Galician, and as an additional test corpus for machine translation systems.
keywords: Galician, Syntax, Universal Dependencies, PUD.
Publication: Congress
1711975632561
April 1, 2024
/research/publications/increasing-manually-annotated-resources-for-galician-the-parallel-universal-dependencies-treebank
This paper presents the development of the Parallel Universal Dependencies (PUD) treebank for Galician. PUD treebanks were originally created for the CoNLL 2017 Shared Task on Multilingual Parsing, and have subsequently been used both to develop NLP tools and to perform cross-linguistic analysis using parallel resources. The Galician PUD consists of 1000 sentences manually reviewed by professional translators and aligned with the other 23 available PUD treebanks. The linguistic annotation was first carried out using state-of-the-art NLP tools for Galician, and then reviewed by two experts, achieving a high inter-annotator agreement. We describe the process of translating, pre-processing, and reviewing the corpus, and discuss the annotation of some linguistic phenomena in comparison with other PUD treebanks. The release of Galician PUD will double the size of the available treebanks for this linguistic variety, as only 1000 reviewed sentences were available to date. It will also be useful for carrying out cross-linguistic analyses including Galician, and as an additional test corpus for machine translation systems. - Xulia Sánchez-Rodríguez, Albina Sarymsakova, Laura Castro, Marcos Garcia
publications_en