Sign up & Download
Sign in

Scalable detection of semantic clones

by Mark Gabel, Lingxiao Jiang, Zhendong Su
Proceedings of the 13th international conference on Software engineering ICSE 08 (2008)

Abstract

Several techniques have been developed for identifying similar code fragments in programs. These similar fragments, referred to as code clones, can be used to identify redundant code, locate bugs, or gain insight into program design. Existing scalable approaches to clone detection are limited to finding program fragments that are similar only in their contiguous syntax. Other, semantics-based approaches are more resilient to differences in syntax, such as reordered statements, related statements interleaved with other unrelated statements, or the use of semantically equivalent control structures. However, none of these techniques have scaled to real world code bases. These approaches capture semantic information from Program Dependence Graphs (PDGs), program representations that encode data and control dependencies between statements and predicates. Our definition of a code clone is also based on this representation: we consider program fragments with isomorphic PDGs to be clones. In this paper, we present the first scalable clone detection algorithm based on this definition of semantic clones. Our insight is the reduction of the difficult graph similarity problem to a simpler tree similarity problem by mapping carefully selected PDG subgraphs to their related structured syntax. We efficiently solve the tree similarity problem to create a scalable analysis. We have implemented this algorithm in a practical tool and performed evaluations on several million-line open source projects, including the Linux kernel. Compared with previous approaches, our tool locates significantly more clones, which are often more semantically interesting than simple copied and pasted code fragments.

Cite this document (BETA)

Available from portal.acm.org
Page 1
hidden

Scalable detection of semantic clones

Scalable Detection of Semantic Clones∗
Mark Gabel Lingxiao Jiang Zhendong Su
Department of Computer Science
University of California, Davis
{mggabel,lxjiang,su}@ucdavis.edu
ABSTRACT
Several techniques have been developed for identifying similar code
fragments in programs. These similar fragments, referred to as
code clones, can be used to identify redundant code, locate bugs,
or gain insight into program design. Existing scalable approaches
to clone detection are limited to finding program fragments that
are similar only in their contiguous syntax. Other, semantics-based
approaches are more resilient to differences in syntax, such as re-
ordered statements, related statements interleaved with other un-
related statements, or the use of semantically equivalent control
structures. However, none of these techniques have scaled to real
world code bases. These approaches capture semantic informa-
tion from Program Dependence Graphs (PDGs), program represen-
tations that encode data and control dependencies between state-
ments and predicates. Our definition of a code clone is also based
on this representation: we consider program fragments with iso-
morphic PDGs to be clones.
In this paper, we present the first scalable clone detection algo-
rithm based on this definition of semantic clones. Our insight is the
reduction of the difficult graph similarity problem to a simpler tree
similarity problem by mapping carefully selected PDG subgraphs
to their related structured syntax. We efficiently solve the tree sim-
ilarity problem to create a scalable analysis. We have implemented
this algorithm in a practical tool and performed evaluations on sev-
eral million-line open source projects, including the Linux kernel.
Compared with previous approaches, our tool locates significantly
more clones, which are often more semantically interesting than
simple copied and pasted code fragments.
Categories and Subject Descriptors
D.2.7 [Software Engineering]: Distribution, Maintenance, and
Enhancement—restructuring, reverse engineering, and reengineer-
ing
∗This research was supported in part by NSF CAREER Grant No.
0546844, NSF CyberTrust Grant No. 0627749, NSF CCF Grant
No. 0702622, US Air Force under grant FA9550-07-1-0532, and a
generous gift from Intel. The information presented here does not
necessarily reflect the position or the policy of the Government and
no official endorsement should be inferred.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICSE’08, May 10–18, 2008, Leipzig, Germany.
Copyright 2008 ACM 978-1-60558-079-1/08/05 ...$5.00.
General Terms
Languages, Algorithms, Experimentation
Keywords
Clone detection, program dependence graph, software maintenance,
refactoring
1. INTRODUCTION
Considerable research has been dedicated to methods for the de-
tection of similar code fragments in programs. Once located, these
fragments, or clones, can be used in many ways. Clones have been
used to gain insight into program design, to identify redundant code
to use as candidates for refactoring, and to be analyzed for consis-
tent usage for the purpose of bug detection.
DECKARD [9], CP-Miner [14], CCFinder [10], and CloneDR [3]
represent the most mature clone detection techniques. These tools
share several common characteristics. Each tool locates syntactic
clones, and each has been shown to scale to millions of lines of
code. Under empirical evaluation, each tool has been shown to
locate comparable numbers of clones.
By operating on token streams and syntax trees, these techniques
locate clones that are resilient to minor code modifications, such
as the changing of types or constant values. This resilience gives
these tools some modicum of semantic awareness: two program
fragments may differ in their concrete syntax, but the normalizing
effects of the respective clone tools allow the detection of their se-
mantic similarity.
The sets of clones located by each of these tools are fundamen-
tally limited by the working definition of a code clone. Each tool
is capable of finding clones solely within a program’s contiguous,
structured syntax. Certain interesting clones can elude detection:
these tools are sensitive to even the most simple structural differ-
ences in otherwise semantically similar code. These structural dif-
ferences can include reordered statements, related statements inter-
leaved with other unrelated statements, or the use of semantically
equivalent control structures.
As a motivating example, consider the code snippet in Figure 1.
When compared with the listing in Figure 2, the code is similar:
both perform the same overall computation, but the latter snippet
contains extra statements to time the loop. Current scalable clone
detection techniques are unable to detect these interleaved clones.
While detecting true semantic similarity is undecidable in gen-
eral, some clone detection techniques have attempted to locate clones
with a less strict, semantics preserving definition of similar code.
Rather than scanning token sequences or similar subtrees, these
techniques have operated on program dependence graphs [6], or
PDGs. A PDG is a representation of a procedure in which the nodes
represent simple statements and control flow predicates, and edges
321
Page 2
hidden
1 int func(int i, int j) {
2 int k = 10;
4 while (i < k) {
5 i++;
6 }
8 j = 2 * k;
10 printf("i=%d, j=%d\n", i, j);
11 return k;
12 }
Figure 1: Example code listing.
1 int func_timed(int i, int j) {
2 int k = 10;
4 long start = get_time_millis();
5 long finish;
7 while (i < k) {
8 i++;
9 }
11 finish = get_time_millis();
12 printf("loop took %dms\n", finish − start);
14 j = 2 * k;
16 printf("i=%d, j=%d\n", i, j);
17 return k;
18 }
Figure 2: A similar example code listing.
encode data and control dependencies. PDG-based similarity de-
tection tools have all used some variant of subgraph isomorphism to
detect either similar procedures or code fragments [12, 16]. These
computations are particularly expensive, and each technique has
not been shown to scale to even moderately-sized code bases.
In this paper, we introduce an extended definition of code clones,
based on PDG similarity, that captures more semantic information
than previous approaches. We then provide a scalable, approximate
algorithm for detecting these clones. We reduce the difficult graph
similarity problem to a simpler tree similarity problem by creating a
mapping between PDG subgraphs and their related structured syn-
tax. Specifically, we make the following technical contributions:
1. We extend the definition of a code clone to include seman-
tically (but not necessarily syntactically) related code frag-
ments. Our definition is a generalization of previous syntac-
tic clone definitions, and it thus defines a superset of previ-
ously defined clones.
2. We introduce an approximate algorithm for detecting these
clones that scales to millions of lines of code. Our algo-
rithm is based on a reduction of deliberately selected PDG
subgraphs to abstract syntax tree forests. We then utilize an
existing, tree-based detection technique [9] to locate clones.
3. We implement a practical tool based on our algorithm and
perform an extensive empirical evaluation. Our tool is capa-
ble of scanning large, real-world C and C++ projects. Com-
pared with previous approaches, our tool locates significantly
more clones, which are often more semantically interesting
than simple copied and pasted code fragments.
The rest of this paper is structured as follows. We begin with a
discussion of background information on components of our anal-
ysis (Section 2). The body of our work continues with the presen-
formal-out
func()
exit
entry
func()
formal-in
int i
formal-in
int j
body
func()
return
return k
ctrl-pt
i < k
expr
k = 10
actual-in
j
expr
j = 2 * k call-site
printf()
expr
return k
expr
i++
actual-in
i
actual-in
“i=%d,
j=%d”
decl
int k Key:
statement node
control point node
data dependency
control dependency
Figure 3: The PDG for Figure 1.
tation of our definitions and algorithm (Section 3). We then dis-
cuss our implementation (Section 4) and present the results of our
empirical evaluation (Section 5). Finally, we discuss related work
(Section 6) and conclude with ideas for future work (Section 7).
2. BACKGROUND
Our algorithm augments an existing clone detection technique,
DECKARD [9], with semantic information derived from program
dependence graphs (PDGs). This section provides the necessary
background on both program dependence graphs and DECKARD’s
vector based clone detection.
2.1 Program Dependence Graphs
A program dependence graph [6] (PDG) is a static representa-
tion of the flow of data through a procedure. It is commonly used
to implement program slicing [20]. The nodes of a PDG consist
of program points constructed from the source code: declarations,
simple statements, expressions, and control points. A control point
represents a point at which a program branches, loops, or enters or
exits a procedure and is labeled by its associated predicate.
A PDG models the flow of data through a procedure. In effect,
the PDG abstracts away many arbitrary syntactic decisions a pro-
grammer made while constructing a function. For example, any
possible arbitrary interleaving of unrelated statements within a pro-
cedure yields precisely the same PDG.
The edges of a PDG encode the data and control dependencies
between program points. Given two program points p
1
and p
2
,
there exists a directed data dependency edge from p
1
to p
2
if and
only if the execution of p
2
depends on data calculated directly by
p
1
. For example, consider the statements on lines 2 and 8 of the
listing in Figure 1. The second statement calculates a value that is
initialized in the first. This dependency is illustrated by a directed
edge between the two nodes in Figure 3.
Note that the node corresponding to the formal parameter j does
not have any outgoing edges. This accurately reflects the fact that j
is redefined without ever being used at line 8 in the listing.
The incrementing of i on line 5 also presents an interesting case.
Because an increment constitutes both a use and a definition, the
node in the PDG corresponding to i++ has both a self data depen-
dency loop and outgoing data dependency edges.
322

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

31 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
55% Ph.D. Student
 
13% Researcher (at a non-Academic Institution)
 
10% Student (Master)
by Country
 
13% United States
 
13% Canada
 
13% China