Large Language Models for Binary Health-Related Question Answering: A Zero- and Few-Shot Evaluation

In this research, we investigate the effectiveness of Large Language Models (LLMs) in answering health-related questions. The rapid growth and adoption of LLMs, such as ChatGPT, have raised concerns about their accuracy and robustness in critical domains such as Health Care and Medicine. We conduct a comprehensive study comparing multiple LLMs, including recent models like GPT-4 or Llama2, on a range of binary health-related questions. Our evaluation considers various context and prompt conditions, with the objective of determining the impact of these factors on the quality of the responses. Addition- ally, we explore the effect of in-context examples in the performance of top models. To further validate the obtained results, we also conduct contamination experiments that estimate the possibility that the models have ingested the benchmarks during their massive training process. Finally, we also analyse the main classes of errors made by these models when prompted with health questions. Our findings contribute to under- standing the capabilities and limitations of LLMs for health information seeking.

keywords: Binary question answering, Health, Large Language Models