HPCNLP: High Performance Computing for Natural Language Processing

Natural Language Processing (NLP) is considered as one of the methodologies more suited to structure and organize the textual information accessible through Internet. Linguistic processing of large amount of text is a complex task that requires the use of several subtasks organized in interconnected modules. One of the main problems found by the researchers in NLP is the high computational cost of their tools as well as their scalability problems, what make them impractical for the analysis of big volumes (Gigabytes or even Terabytes) of documents. In this way, the use of High-performance computing (HPC) is indispensable if you want to reduce in a significant way the computational cost and to improve the system scalability. This is a crucial aspect if you want to deal with very large amount of text. In this project we will apply both paralelization and optimization techniques, by making use of technologies for Big Data, to linguistic prototypes that perform diverse NLP tasks, with the aim of integrating them in a suite of NLP modules, which are both efficient and scalable. The new NLP modules, which will be developed in this project, will be suited to be used in more complex and higher level linguistic applications, so that they will be improved in terms of efficiency. We have to highlight that the applications of linguistic engineering that can benefit of these modules are, among others, machine translation, information retrieval, question&answering, or even new intelligent systems for technological surveillance and monitoring.

Objectives

NLP techniques can be mainly divided into two related categories: on one hand, text analysis, and on the other, the Information Extraction (IE). The IE processes often use analyzed text and, at the same time, the text analysis techniques improve their performance when using information previously extracted from the text. In this research project we will applied different parallelization and optimization techniques to three NLP tasks. In particular, two of them are text analysis methods: Named Entity Recognition (NER), Dependencies syntactic analysis; The third one belongs to the Information Extraction category: Relationship extraction

Link to the Project Website