From 'Dr. Google' to ChatGPT: Are the Network's Answers to Health Questions Reliable?

A study published in the scientific journal ‘NPJ Digital Medicine’, from the ‘Nature’ group, analyzes the degree of truthfulness of the responses obtained on the Internet regarding health-related questions, whether through conventional search engines or Artificial Intelligence tools.

Can we trust the information returned by the most popular content search engines (like Google or Yahoo), or even large language models (like ChatGPT), when we turn to the Internet for answers to health questions?: science is investigating this. One of the latest contributions in this field has just been published in the journal NPJ Digital Medicine, which is part of the prestigious Nature publishing group. In it, a team of experts in Information Retrieval, Text Mining, and High Performance Computing from the CiTIUS (Centro Singular de Investigación en Tecnologías Inteligentes de la Universidade de Santiago de Compostela), has selected a sample of web search engines and artificial intelligence (AI) models to analyze the behavior of these systems in response to medical queries made by the general public.

The study raises a reasonable doubt in the current context: is it more reliable to search for information on medical symptoms through a traditional search engine, or to do so through a conversational artificial intelligence? "We used to talk about 'Dr. Google,'" the authors point out. "Now, AIs are joining in: we wanted to know to what extent these tools provide correct medical answers, what types of errors they make, and how we can combine them to get the best out of each."

Google or ChatGPT: who responds better?

The study evaluated the performance of four traditional search engines (Google, Bing, Yahoo, and DuckDuckGo) and seven conversational AI models, including general-purpose systems like ChatGPT and LLaMA3, or MedLLaMA, a model specifically trained to provide answers to medical questions. The researchers measured the ability of all these technologies to offer correct medical answers to a set of standardized queries, using a battery of real medical questions.

"Among the most relevant findings of the study," points out Marcos F. Pichel, the first author of the work, "it is observed that traditional search engines offer between 60% and 70% correct answers within the first twenty results, although many of the pages retrieved are irrelevant or do not provide clear information to resolve the medical question." As for the use of AIs, the postdoctoral researcher at CiTIUS (a center co-financed by the European Union through the Galicia Feder Program 2021-2027) recognizes a higher success rate, although he warns that its use is not without risks: "Conversational artificial intelligences have a higher success rate, ranging from 80% to 90%, but... they may incur a characteristic problem of these types of systems: the generation of false answers expressed with great confidence, which we know as hallucinations." The error analysis conducted throughout the study has grouped the failures into three main categories: those that contradict established medical consensus ("the most concerning"); those that arise from a misinterpretation by the AI of the question posed (usually because it lacks basic knowledge about how things work in the real world, which humans often call common sense); and those that result in responses that are too vague or imprecise, which, in practice, do not provide real help to those who need it.

Another author, Juan Carlos Pichel, emphasizes the importance of how questions are formulated: "The models are very sensitive to context," he says, noting that a well-designed prompt (query message) can greatly improve the response. But the opposite could also happen: "an ambiguous question generates dangerous responses," says the professor in Computer Architecture and Technology at USC. The study evaluates different levels of context, allowing observation of how the quality of the response varies according to the type of prompt used. "One of the most serious risks we detected with the use of AIs is that if they do not understand the question well or lack sufficient context, they can offer unsafe advice," he warns. "And the most concerning part is that they do so with great assertiveness, which can induce fatal errors, with direct consequences on people's health." The study reinforces the idea that the way the question is formulated has a crucial impact on the quality of the response. "The same AI can go from making a mistake to getting it right simply by rephrasing the prompt," concludes Pichel.

Search engines vs. AIs: unity makes strength

For David Losada, professor of Computer Science and Artificial Intelligence, a key part of the work is exploring how to enrich AIs with results obtained by search engines, using retrieval-augmented generation techniques (better known as RAG for Retrieval-Augmented Generation). "Injecting web results into the prompt allows lighter AIs, less costly to train and therefore more efficient, to reason from external and current information to generate accurate responses, without needing to have all the information pre-stored in their parameters. It's a very promising strategy for AI-assisted medical systems because it presents a horizon of safe and sustainable future," says Losada.

"The results of our work show that conversational AIs often offer more precise and focused responses than search engines, but they can also make serious errors," explains David Losada. "The problem with search engines is that they return a lot of irrelevant or ambiguous information. The AI, on the other hand, offers you a single response, which can be good... or completely wrong."

The study concludes that both search engines and AIs have the potential to provide useful medical information but require informed use: "our message is not to choose between one or the other, but to learn to use them well and know when to be skeptical." Therefore, the authors emphasize the need for training, both for the general public and healthcare professionals. "It is not about banning or replacing them, but about understanding how these technologies work and learning to make smart, informed use of them. At best, both search engines and AIs make between 10% and 15% errors, and in medical matters, that margin can be very delicate if not detected in time," warns the team responsible for the work. "Both citizens and healthcare professionals must be aware of the limits and strengths of these technologies. Digital health literacy is key."