Analysing Agreement Among Different Evaluators in God Class and Feature Envy Detection

The automatic detection of Design Smells has evolved in parallel to the evolution of automatic refactoring tools. There was a huge rise in research activity regarding Design Smell detection from 2010 to the present. However, it should be noted that the adoption of Design Smell detection in real software development practice is not comparable to the adoption of automatic refactoring tools. On the basis of the assumption that it is the objectiveness of a refactoring operation as opposed to the subjectivity in definition and identification of Design Smells that makes the difference, in this paper, the lack of agreement between different evaluators when detecting Design Smells is empirically studied. To do so, a series of experiments and studies were designed and conducted to analyse the concordance in Design Smell detection of different persons and tools, including a comparison between them. This work focuses on two well known Design Smells : God Class and Feature Envy . Concordance analysis is based on the Kappa statistic for inter-rater agreement (particularly Kappa-Fleiss ). The results obtained show that there is no agreement in detection in general, and, in those cases where a certain agreement appears, it is considered to be a fair or poor degree of agreement, according to a Kappa-Fleiss interpretation scale. This seems to confirm that there is a subjective component which makes the raters evaluate the presence of Design Smells differently. The study also raises the question of a lack of training and experience regarding Design Smells .

keywords: Design smell, survey, empirical study, experiment, inter-rater agreement, Kappa-Fleiss