
Lecture: 'High Quality Human Evaluation of Generated Texts'
Evaluating the quality of texts generated by modern large language models is difficult. In this talk, will discuss some general evaluation challenges and then focus specifically on the role of human evaluations. Human evaluations are the best way to evaluate more subtle aspects of LLMs, such as task appropriateness and real-world impact, but only if they are done rigorously. I will summarise our work in replicating and identifying weaknesses in existing human evaluations and in designing improved evaluation protocols. I'll conclude with advice about conducting high-quality human evaluations.
About the speaker
Ehud Reiter is a Professor of Computing Science at the University of Aberdeen and was formerly Chief Scientist of Arria NLG (a spinout he cofounded). He has been working on Natural Language Generation for 35 years, and in recent years has focused on evaluation of language generation; he also has a longstanding interest in healthcare applications. He is one of the most cited and respected researchers in NLG, and his awards include an INLG Test of Time award for his work on data-to-text. He writes a widely read blog on NLG and evaluation (ehudreiter.com)
On-site event
/events/lecture-high-quality-human-evaluation-of-generated-texts
events_en