A comparison of code similarity analysers

Chaiyong Ragkhitwetsagul; Jens Krinke; David Clark

Journal ArticleOPEN ACCESS

A comparison of code similarity analysers

Empirical Software Engineering (2018) 23(4) 2464-2519

DOI: 10.1007/s10664-017-9564-7

105Citations

152Readers

Abstract

Copying and pasting of source code is a common activity in software engineering. Often, the code is not copied as it is and it may be modified for various purposes; e.g. refactoring, bug fixing, or even software plagiarism. These code modifications could affect the performance of code similarity analysers including code clone and plagiarism detectors to some certain degree. We are interested in two types of code modification in this study: pervasive modifications, i.e. transformations that may have a global effect, and local modifications, i.e. code changes that are contained in a single method or code block. We evaluate 30 code similarity detection techniques and tools using five experimental scenarios for Java source code. These are (1) pervasively modified code, created with tools for source code and bytecode obfuscation, and boiler-plate code, (2) source code normalisation through compilation and decompilation using different decompilers, (3) reuse of optimal configurations over different data sets, (4) tool evaluation using ranked-based measures, and (5) local + global code modifications. Our experimental results show that in the presence of pervasive modifications, some of the general textual similarity measures can offer similar performance to specialised code similarity tools, whilst in the presence of boiler-plate code, highly specialised source code similarity detection techniques and tools outperform textual similarity measures. Our study strongly validates the use of compilation/decompilation as a normalisation technique. Its use reduced false classifications to zero for three of the tools. Moreover, we demonstrate that optimal configurations are very sensitive to a specific data set. After directly applying optimal configurations derived from one data set to another, the tools perform poorly on the new data set. The code similarity analysers are thoroughly evaluated not only based on several well-known pair-based and query-based error measures but also on each specific type of pervasive code modification. This broad, thorough study is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Ragkhitwetsagul, C., Krinke, J., & Clark, D. (2018). A comparison of code similarity analysers. Empirical Software Engineering, 23(4), 2464–2519. https://doi.org/10.1007/s10664-017-9564-7

Readers' Seniority

PhD / Post grad / Masters / Doc 64

74%

Researcher 10

11%

Professor / Associate Prof. 7

Lecturer / Post doc 6

Readers' Discipline

Computer Science 86

89%

Engineering 6

Social Sciences 3

Agricultural and Biological Sciences 2

Article Metrics

Mentions

References: 1

Social Media

Shares, Likes & Comments: 38

View details >

A comparison of code similarity analysers

Abstract

Author supplied keywords

References Powered by Scopus

CCFinder: A multilinguistic token-based code clone detection system for large scale source code

Clustering by compression

DECKARD: Scalable and accurate tree-based detection of code clones

Cited by Powered by Scopus

A Systematic Review on Code Clone Detection

Siamese: scalable and incremental code clone search via multiple code representations

Toxic Code Snippets on Stack Overflow

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline

Article Metrics