DepreSym: A Depression Symptom Annotated Corpus and the Role of Large Language Models as Assessors of Psychological Markers
Computational methods for depression detection aim to mine traces of depression from online publications posted by Internet users. However, solutions trained on existing collections exhibit limited generalisation and interpretability. To tackle these issues, recent studies have shown that identifying specific depressive symptoms can lead to more robust and effective models. The eRisk initiative fosters research on this area and has recently proposed a new ranking task focused on developing search methods to find sentences related to depressive symptoms. This search challenge relies on the symptoms specified by the Beck Depression Inventory-II (BDI-II), a questionnaire widely used in clinical practice. It includes symptoms such as sadness, irritability or lack of sleep. Given the input submitted by systems participating in eRisk, we first apply top-k pooling over the systems’ relevance rankings, obtaining a diverse set of sentences. These sentences are judged for relevance, leading to DepreSym, a dataset consisting of 21,580 sentences annotated according to their relevance to the 21 BDI-II symptoms. This dataset serves as a valuable resource for advancing the development of models that monitor depression markers. Due to the complex nature of this relevance annotation, we designed a robust assessment methodology carried out by three expert assessors, including a trained psychologist. As part of this study, we explore the potential of recent Large Language Models (ChatGPT, GPT4 and Vicuna) as assessors in this complex task. We undertake a comprehensive examination of the LLMs’ performance, studying their main limitations and analysing their role as a complement or replacement for human annotators. Finally, we incorporate our dataset into the Benchmarking Information Retrieval (BEIR) framework for a thorough search evaluation. We use state-of-the-art retrieval systems, including lexical, sparse, dense and re-ranking architectures, to gain insights about the dataset’s complexity and identify potential avenues for improvement.
keywords: Depression, Social media mining, Large Language models, Search, Information Retrieval