Plagiarism detection using software tools: a study in a Computer Science degree

1.1. Background Plagiarism of projects submitted by students for evaluation in courses at the undergraduate level is a problem that produces increasing concerns in instructors in universities, since detection of plagiarized documents is becoming harder in continuously assessed courses, where the number of projects delivered by students is huge. Plagiarism presents particular features in Computer Science and related degrees, such as: (i) students must learn distinction between plagiarism and software reutilization as a basis of professional honesty, (ii) project-based assessment is widely extended and cannot be fully based on individual interviews with students, (iii) implementation-related skills are assessed by reviewing the source code of projects and (iv) sharing of knowledge has to be promoted among students. 1.2. Alternatives In this paper we describe a study on plagiarism detection in programming projects of 8 courses of a BSc in Computer Science. 865 projects of different size (from 20 to 2000 source code lines) written in C and Modula-2 programming languages were screened using two plagiarism detection software tools that produce originality reports for each project including a global similarity index (SI). The reports were individually analysed in detail by the instructor of each course showing that even projects with very high SI values are not actually plagiarized. Quantitatively, 26 projects among the 100 ones that were evaluated by the tools as having SI >75% exhibited plagiarism evidences to some extent (3% of total). Usual reasons for high SI in non-plagiarized projects were legitimate reuse of code, the repetitive syntax of programming languages, or use of common modules for basic tasks usually solved in the same way. Due to this, it became clear that a manual in-depth individualized post-analysis of the reports needs to be done in order to avoid false positives. Having high quality and usability review facilities (such as highlighting similar fragments among documents, quick navigation between fragments, and easy access to external sources of potential plagiarism) are very valuable additions to these tools, which help to reduce time devoted to the necessary manual inspection of documents. Such features are very welcome by users. 1.3. Conclusions It became clear after the study that inclusion of knowledge to plagiarism detection tools is a need when applied to programming projects. This knowledge is related to (i) a description of the resources in the courses and the minimum SI threshold that is acceptable (stating the reusable code, ...), (ii) the implicit information that instructors provide when a given document is labelled as plagiarized or not and (iii) including automated learning mechanisms for refinement of the plagiarized fragments detection. Addition of these features to a plagiarism detection software tool together with a good integration in the assessment workflow are key issues for constructing a valuable support system to e-learning based continuous assessment in programming courses.

keywords: education