An Empirical Study on the Number of Items in Human Evaluation of Automatically Generated Texts

Human evaluation of neural models in the research field of Natural Language Generation (NLG) requires a careful experimental design (e.g., number of evaluators, number of items to assess, number of quality criteria, etc.) for the sake of reproducibility as well as for ensuring that it is possible to draw significant conclusions. Although there are some generic recommendations on how to proceed, there is not an evaluation protocol admitted worldwide. In this paper, we address empirically the impact of the number of items to assess in the context of the human evaluation of NLG systems. We apply different resampling methods to simulate the evaluation of different sets of items by each evaluator. Then, we compare the results obtained by evaluating only a limited set of items with those obtained by evaluating all outputs of the system for a given test set. Empirical findings validate the initial research hypothesis: well-known resampling statistical methods can contribute to getting significant results even with a small number of items to be evaluated by each evaluator.

keywords: Natural Language Generation, Human evaluation, Resampling methods