From 'Dr. Google' to ChatGPT: Are the Internet's Answers to Health Questions Reliable?

A study published in the scientific journal ‘NPJ Digital Medicine’, part of the ‘Nature’ group, analyzes the degree of truthfulness of the responses obtained on the Internet regarding health-related queries, whether through conventional search engines or Artificial Intelligence tools.

Can we trust the information provided by the most popular content search engines (like Google or Yahoo), or even large language models (like ChatGPT), when we turn to the Internet to resolve health-related queries? Science is investigating it. One of the latest contributions in this field has just been published in the journal NPJ Digital Medicine, owned by the prestigious Nature publishing group. In it, a team of experts in Information Retrieval, Text Mining, and High-Performance Computing from CiTIUS (Centro Singular de Investigación en Tecnologías Inteligentes de la Universidade de Santiago de Compostela), has selected a sample of web search engines and artificial intelligence (AI) models to analyze the behavior of these systems when faced with medical queries made by the general public.

The study raises a reasonable doubt in the current context: is it more reliable to search for information on medical symptoms in a traditional search engine, or to do so through a conversational artificial intelligence? "We used to talk about 'Dr. Google'," the authors point out. "Now, AIs are added: we wanted to know to what extent these tools provide correct medical answers, what types of errors they make, and how we can combine them to get the best out of each one."

Google or ChatGPT: who responds better?

The study evaluated the performance of four traditional search engines (Google, Bing, Yahoo, and DuckDuckGo) and seven conversational AI models, including general-purpose systems like ChatGPT and LLaMA3, or MedLLaMA, a model specifically trained to provide answers to medical questions. The researchers measured the ability of all these technologies to offer correct medical answers to a set of standardized queries, using a battery of real medical questions.

"Among the most relevant findings of the study," notes Marcos F. Pichel, the first author of the work, "it is observed that traditional search engines offer between 60% and 70% correct answers within the top twenty results, although many of the recovered pages are irrelevant or do not provide clear information to resolve the medical doubt." Regarding the use of AIs, the postdoctoral researcher at CiTIUS recognizes a higher percentage of successes, although he warns that their use is not without risks: "conversational artificial intelligences present a higher accuracy rate, ranging from 80% to 90%, but... they can incur in a characteristic problem of this type of systems: the generation of false answers expressed with great confidence, what we know as hallucinations." The error analysis carried out throughout the study has allowed grouping the errors into three major categories: those that contradict the established medical consensus ("the most concerning"); those that arise from a poor interpretation of the AI of the posed question (generally because it lacks the basic knowledge of how things work in the real world, what humans usually call common sense); and those that lead to overly vague or imprecise answers that, in practice, do not provide real help to those who need it.

Another author, Juan Carlos Pichel, emphasizes the importance of how questions are formulated: "Models are very sensitive to context," he states, noting that a well-designed prompt (query message) can greatly improve the response. However, the opposite could also happen: "an ambiguous question generates dangerous answers," says the professor of Computer Architecture and Technology at USC. The study evaluates different levels of context, allowing the observation of how the quality of the response varies according to the type of prompt used. "One of the most serious risks we detected with the use of AIs is that if they do not understand the question well or lack sufficient context, they can offer unsafe advice," he warns. "And the most concerning thing is that they do it with great assertiveness, which can induce fatal errors, with direct consequences on people's health." The work emphasizes the idea that the way the question is formulated has a crucial impact on the quality of the response. "The same AI can jump from making a mistake to getting it right just by reformulating the prompt," concludes Pichel.

Search engines vs. AIs: united they are stronger

For David Losada, professor of Computer Science and Artificial Intelligence, a key part of the work is exploring how to enrich AIs with results obtained by search engines, using retrieval-augmented generation techniques (better known as RAG - Retrieval-Augmented Generation). "Injecting web results into the prompt allows lighter AIs, less costly to train and therefore more efficient, to reason based on external and current information to generate accurate answers, without needing to have all the information pre-stored in their parameters. It is a very promising strategy for AI-assisted medical systems, as it presents a secure and sustainable future horizon," says Losada.

"The results of our work show that conversational AIs tend to offer more precise and focused responses than search engines, but they can also make serious errors," explains David Losada. "The problem with search engines is that they return a lot of irrelevant or ambiguous information. The AI, on the other hand, offers you a single answer, which can be good... or completely wrong."

The study concludes that both search engines and AIs have the potential to provide useful medical information, but they require informed use: "our message is not to choose one over the other but to learn to use them well and know when to be skeptical." Therefore, the authors emphasize the need for education, both for the general public and healthcare professionals. "It is not about prohibiting or replacing, but about understanding how these technologies work and learning to make the most of them in a critical and informed way. At best, both search engines and AIs make between 10 and 15% errors, and in medical matters, that margin can be very delicate if not detected in time," warns the team responsible for the work. "Both citizens and healthcare professionals must be aware of the limits and strengths of these technologies. Digital health literacy is key."