LingUMT: Linguistically Motivated Stategies for Unsupervised Machine Translation

Unsupervised machine translation (UMT) is a new paradigm in machine translation (MT) that relies only on monolingual data. This is in contrast to the more popular approaches in MT, which require parallel data to train models effectively. Although it is possible to find large numbers of parallel corpora for some language pairs, these resources are scarce for many languages and domains. In addition, monolingual data predominate, are found in a larger number of domains and genres, and are constantly growing, even in languages with fewer linguistic resources. Exploring the use of monolingual corpora for machine translation can also be beneficial for two other reasons: first, monolingual corpora reduce the biases of literal translation, and second, their use as a translation source allows for a more appropriate simulation of the so-called translation competence of bilingual individuals.

Objectives

This project focuses on the exploration, definition and implementation of linguistically motivated, unsupervised machine translation strategies that approximate machine translation to the paraphrasing process in a bilingual context. Translation is conceived and projected as the semantic process of paraphrasing between two linguistic codes. The project will deal with concepts such as dependency syntax, contextualized meaning, distributional similarity, and syntactic-semantic constructions. In particular, we will analyze the impact of semantic (non-)compositionality on the computational representation of meaning, both in monolingual contexts and in translation between two linguistic varieties.