Improving the Reliability of Health Information Credibility Assessments

The applicability of retrieval algorithms to real data relies heavily on the quality of the training data. Currently, the creation process of training and test collections for retrieval systems is often based on annotations produced by human assessors following a set of guidelines. Some concepts, however, are prone to subjectivity, which could restrict the utility in the real world of any algorithm developed with the resulting data. One such concept is credibility, which is an important factor in determining whether retrieved information is accepted. In this paper, we evaluate an existing set of guidelines with respect to their ability to generate reliable credibility judgements across multiple raters. We identify reasons for disagreement and adapt the guidelines to create an actionable and traceable annotation scheme that i) leads to higher inter-annotator reliability, and ii) can inform about why a rater made a specific credibility judgement. We provide promising evidence about the robustness of the new guidelines and conclude that they could be a valuable resource for building future test collections for misinformation detection.

keywords: Reliability, Credibility assessments, Health-related content