An Exploration of the Linguistic Knowledge for Semantic Relation Extraction in Spanish

A common strategy for Question Answering systems uses high quality ontologies or databases in order to efficiently answer questions. Some approaches to build or enrich these databases rely on machine learning classifiers for obtaining semantically related terms from unstructured text. These classifiers are based on features that may contain several kinds of linguistic knowledge: from orthographic or lexical information to more complex features, including PoS-tags, syntactic dependencies or semantic information. In this paper we select four main types of linguistic features and systematically evaluate their performance on semantic Relation Extraction. Although the combination of some types of linguistic features allows us to improve the f-score of the classifiers, we observed that by adjusting the positive/negative ratio of the training examples, we can build high quality classifiers with just a single type of linguistic feature, based on generic lexico-syntactic patterns. Experiments were carried out with the Spanish version of Wikipedia

keywords: