Evaluation of explainable AI by medical experts: a survey of the existing approaches
In this survey, we examine the landscape of Explainable Artificial Intelligence (XAI) techniques evaluation within the medical domain, focusing on studies evaluated by medical practitioners. Our analysis delves into the prevailing trends and identifies notable deficiencies in the current evaluation methodologies. Notably, we uncover that significant details of evaluation studies—such as the user study interface, study location, and participant remuneration—are often disregarded in the final reports describing the evaluation studies. Furthermore, our findings reveal a concerning scarcity of statistical significance testing in evaluation results, leading to overly optimistic conclusions regarding the applicability of XAI. Additionally, we highlight a prevalent trend of assessing XAI in isolation, devoid of comparative analysis, and an unbalanced emphasis on XAI perception attributes like usefulness, human-AI performance, and clinical relevance, overshadowing other crucial properties. Another issue identified in our survey is the imprecise formulation of participant queries, resulting in an excessive number of similarly purposed questions. The culmination of our study is not only an exposition of these findings but also a curated set of recommendations consisting of the main steps to be done before, during, and after the evaluation study aimed at researchers endeavoring to deploy XAI techniques in real medical applications. These guidelines are designed to enhance the genuine usability evaluation of XAI tools by medical professionals, ensuring a robust and meaningful application of XAI in healthcare.
keywords: Explainable AI, machine learning in medicine
Publication: Congress
1762430764578
November 6, 2025
/research/publications/evaluation-of-explainable-ai-by-medical-experts-a-survey-of-the-existing-approaches
In this survey, we examine the landscape of Explainable Artificial Intelligence (XAI) techniques evaluation within the medical domain, focusing on studies evaluated by medical practitioners. Our analysis delves into the prevailing trends and identifies notable deficiencies in the current evaluation methodologies. Notably, we uncover that significant details of evaluation studies—such as the user study interface, study location, and participant remuneration—are often disregarded in the final reports describing the evaluation studies. Furthermore, our findings reveal a concerning scarcity of statistical significance testing in evaluation results, leading to overly optimistic conclusions regarding the applicability of XAI. Additionally, we highlight a prevalent trend of assessing XAI in isolation, devoid of comparative analysis, and an unbalanced emphasis on XAI perception attributes like usefulness, human-AI performance, and clinical relevance, overshadowing other crucial properties. Another issue identified in our survey is the imprecise formulation of participant queries, resulting in an excessive number of similarly purposed questions. The culmination of our study is not only an exposition of these findings but also a curated set of recommendations consisting of the main steps to be done before, during, and after the evaluation study aimed at researchers endeavoring to deploy XAI techniques in real medical applications. These guidelines are designed to enhance the genuine usability evaluation of XAI tools by medical professionals, ensuring a robust and meaningful application of XAI in healthcare. - Nikolay Babakov, Elena Rezgova, Ehud Reiter, Alberto Bugarín
publications_en