Reducing, Reusing and Recycling large models for developing Responsible and Green Language Technologies

Language is the most efficient tool used by humans to transmit information. Most of the available digital information contains unstructured data in the form of documents in multiple languages, which represents a challenge for any organisation that wants to exploit and process this information. Natural language processing (NLP), which includes automatic language understanding (NLU) and automatic language generation (NLG), is one of the main challenges of artificial intelligence and has a fast-growing economic impact on today's digital transformation. NLP is at the heart of software that processes information and exploits the vast amount of data contained on the web, social networks, etc. Despite their impressive capabilities, pre-trained language models present serious problems from research, environmental and ethical perspectives. The main research goal of the DeepR3 project is to advance the state of the art of Deep Learning (DL) technology for NLU and NLG by (i) developing efficient methods to extend existing models for the official languages of Spain (Spanish, Catalan, Basque and Galician) and English to new domains, genres and languages; (ii) exploring novel ways of pre-training and tuning language models in an efficient way, thus reducing the carbon footprint associated with training such models; (iii) addressing NLU tasks through text generation; (iv) address the explainability of DL-based language models through NLG tasks; (v) develop efficient techniques that reuse and recycle pre-trained models for machine translation (MT); (vi) apply the developed techniques to improve the state of the art in NLP; (vii) develop new evaluation datasets to analyse progress towards responsible NLP; (viii) generate scientific interest in the project by organising international evaluation competitions; and (ix) develop a series of advanced content-based domain applications for the project languages, across multiple sectors and domains.

The subproject has as specific objectives: (i) define (and verify compliance with) guidelines and requirements for the development of responsible NLP with ELSEC (Ethical, Legal, Socio-Economic and Cultural) perspective; (ii) define new metrics for the intrinsic and extrinsic evaluation of NLP tasks; (iii) design a set of experiments and data to evaluate the linguistic capabilities of language models; and (iv) design, implement and validate DL-based NLG systems for meteorology and health that will reuse data, corpora, know-how and pre-trained models for weather forecasting operations, air quality index information, and cardiovascular and neurodegenerative diseases. The main challenge is the generation of reports and alerts, with emphasis on the tasks of explainability and evaluation. Monolingual (enriched with AT) and multilingual models will be evaluated by experts (meteorologists or medical staff) and end users not only in Galician but also in English, Spanish, Catalan and Basque, in collaboration with the other subprojects. CiTIUS-USC will lead a Panel on responsible NLP (WP1), WP5 (Evaluation) and WP6 (Applications and Use Cases) and will actively participate in the rest of the WPs.

This project DeepR3 (TED2021-130295B-C33) has received funding from MCIN/AEI/10.13039/501100011033 and the European Union “NextGenerationEU”/PRTR.