The Impact of Linguistic Knowledge in Different Strategies to Learn Cross-Lingual Distributional Models

In recent years, with the emergence of neural networks and word embeddings, there has been a growing interest in working on cross-lingual distributional models learned from monolingual corpora to induce bilingual lexicons However, interest in these models existed prior to the emergence of deep learning. In this article, we will study the differences between the recent strategies, which are based on the alignment of models, as opposed to the old methods focused on the use of bilingual anchors aligning the text itself. We will also analyze the impact of including different levels of linguistic knowledge (e.g. lemmatization, PoS tagging, syntactic dependencies) in the process of building cross-lingual models for English and Spanish. Our experiments show that syntactic information benefits traditional models based on text alignment but harms mapped cross-lingual embeddings.

keywords: Cross-Lingual Embeddings, Monolingual Corpora, Information Extraction, Natural Language Processing