[cfe-dev] CopyPaste detection clang static analyzer

Fri Mar 20 07:52:42 PDT 2015

Hi all,

I am a student in Peking university, China. And I am pursing my master's
degree. I have some experience with LLVM and static analysis such as wrote a
LLVM pass to detect use-after-free bugs. I'd love to work in the project
copy-paste detection. Please give me some advise about my initial proposal.
Thank you so much.

*Introduction*
We begin with a basic introduction to clone detection. There are three types
of clones.
Type 1 (Exact clones): Identical code fragments except small variations in
white spaces, layout and comments.
Type 2 (Renamed clones): Syntactically identical code fragments except for
variations in literals, identifiers, types, white spaces, layout and
comments .
Type 3 (Modified clones): These are copied fragments with additional
modifications such as changed, added or removed statements, in addition to
variations in literals, identifiers, types, white spaces, layout and
comments.

And techniques of code clone detection can be classified as follows.

These techniques have pros and cons.  Comparison between them is described
in survey papers http://ijarcsms.com/docs/paper/volume3/issue1/V3I1-0012.pdf
and http://ijcsn.org/IJCSN-2014/3-2/Review-of-Code-Clone-Detection.pdf.

*Implementation Plan*
I plan to develop copy-paste detection in three steps.
1.  Detection of Type 1
I’ll start off with Nick’s patch which have analyzed conditional operators
and if/elseif chains. I’ll improve Nick’s patch to collect all the
expressions through something like a && b &&c && a. This part could be
implemented as compiler diagnostics because it is efficient and with very
low false positive rate.

2. Detection of Type 2
I plan to adopt AST-based method described in
http://www.informatik.uni-bremen.de/st/IWSC/bulychev.pdf . The abstract
syntax tree of program is first linearized, i.e., all sequences of
statements and definitions are presented in the abstract tree as sibling
subtrees. Then three phases of the algorithm are preformed. Three phases are
as follows.

(1) Partition similar statements into clusters using anti-unification. After
the first phase each statement is marked with its cluster ID – thus, two
statements with the same cluster ID are considered similar in this
preliminary view. This phase uses a two-pass clustering algorithm. During
the first pass all statements are gone over and each new statement is
compared with the anti-unifiers of all existing clusters. During the second
pass all the statements are traversed again and for each statement we search
for the most similar pattern from the set produced in the previous pass
(again using the anti-unification distance). And hashing (using the notion
of d-cup) is used to speed up the clustering process.
(2) Find identical sequences of cluster IDs. All pairs of sequences of
statements which are identically labeled (have the same labels on the same
position) are searched using a suffix tree approach.
(3) Refine by examining identified code sequences for structural similarity.
In this phase, every pair of candidate sequences is checked for overall
similarity at the statement level, again using anti-unification.

This part could implemented as bug checker.

3. Detection of Type 3
AST-based method can only detected Type 3 with low accuracy, because the
added or deleted instructions strictly change the structure of AST. So we
should introduce semantic analysis to achieve high accuracy. But algorithms
based on semantic analysis have high computational complexity, which makes
them unusable for large software systems analysis. Authors of the paper
http://mpcs.sci.am/filesimages/volumes/volume_42/05.pdf (Related talk
http://llvm.org/devmtg/2015-02/slides/sargsyan-code-clone.pdf) proposed a
method which computes metrics for nodes of PDG and compares them. For every
node of program dependence graph a characteristic vector is constructed,
which contains information about neighbors. These characteristic vectors are
represented as 64-bit integer numbers, which allows determining similarity
between two nodes in amortized constant time. The paper also describes new
approach for dependency graphs generation, which allows building them much
faster than any of the existing methods. 
IMO we can imitate the method to implement a bug checker of type 3. But we
don’t focus on IR level. 

In the mean time, I will develop different test cases of different clone
types to test precision and recall of detection. Then a thorough test suite
will be developed.
Finally, I will commit detailed documentations and prepare a final poster of
the work. 

*Timeline*
TODO

Best regards,
Jiayi Ye

--
View this message in context: http://clang-developers.42468.n3.nabble.com/CopyPaste-detection-clang-static-analyzer-tp4037634p4044606.html
Sent from the Clang Developers mailing list archive at Nabble.com.