A comparison of code similarity analysers

105Citations
Citations of this article
152Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code.

References Powered by Scopus

CCFinder: A multilinguistic token-based code clone detection system for large scale source code

1334Citations
N/AReaders
Get full text

Clustering by compression

862Citations
N/AReaders
Get full text

DECKARD: Scalable and accurate tree-based detection of code clones

848Citations
N/AReaders
Get full text

Cited by Powered by Scopus

A Systematic Review on Code Clone Detection

112Citations
N/AReaders
Get full text

Siamese: scalable and incremental code clone search via multiple code representations

53Citations
N/AReaders
Get full text

Toxic Code Snippets on Stack Overflow

50Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Ragkhitwetsagul, C., Krinke, J., & Clark, D. (2018). A comparison of code similarity analysers. Empirical Software Engineering, 23(4), 2464–2519. https://doi.org/10.1007/s10664-017-9564-7

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 64

74%

Researcher 10

11%

Professor / Associate Prof. 7

8%

Lecturer / Post doc 6

7%

Readers' Discipline

Tooltip

Computer Science 86

89%

Engineering 6

6%

Social Sciences 3

3%

Agricultural and Biological Sciences 2

2%

Article Metrics

Tooltip
Mentions
References: 1
Social Media
Shares, Likes & Comments: 38

Save time finding and organizing research with Mendeley

Sign up for free