[cfe-dev] CopyPaste detection clang static analyzer

Thu Mar 26 07:13:39 PDT 2015

My comments went in google-melange. Could you please post the newest 
version here once again when you finish it?
Vassil
On 20/03/15 15:52, Jiayi Ye wrote:
> Hi all,
>
> I am a student in Peking university, China. And I am pursing my master's
> degree. I have some experience with LLVM and static analysis such as wrote a
> LLVM pass to detect use-after-free bugs. I'd love to work in the project
> copy-paste detection. Please give me some advise about my initial proposal.
> Thank you so much.
>
> *Introduction*
> We begin with a basic introduction to clone detection. There are three types
> of clones.
> Type 1 (Exact clones): Identical code fragments except small variations in
> white spaces, layout and comments.
> Type 2 (Renamed clones): Syntactically identical code fragments except for
> variations in literals, identifiers, types, white spaces, layout and
> comments .
> Type 3 (Modified clones): These are copied fragments with additional
> modifications such as changed, added or removed statements, in addition to
> variations in literals, identifiers, types, white spaces, layout and
> comments.
>
> And techniques of code clone detection can be classified as follows.
>
> These techniques have pros and cons.  Comparison between them is described
> in survey papers http://ijarcsms.com/docs/paper/volume3/issue1/V3I1-0012.pdf
> and http://ijcsn.org/IJCSN-2014/3-2/Review-of-Code-Clone-Detection.pdf.
>
> *Implementation Plan*
> I plan to develop copy-paste detection in three steps.
> 1.  Detection of Type 1
> I’ll start off with Nick’s patch which have analyzed conditional operators
> and if/elseif chains. I’ll improve Nick’s patch to collect all the
> expressions through something like a && b &&c && a. This part could be
> implemented as compiler diagnostics because it is efficient and with very
> low false positive rate.
>
> 2. Detection of Type 2
> I plan to adopt AST-based method described in
> http://www.informatik.uni-bremen.de/st/IWSC/bulychev.pdf . The abstract
> syntax tree of program is first linearized, i.e., all sequences of
> statements and definitions are presented in the abstract tree as sibling
> subtrees. Then three phases of the algorithm are preformed. Three phases are
> as follows.
>
> (1) Partition similar statements into clusters using anti-unification. After
> the first phase each statement is marked with its cluster ID – thus, two
> statements with the same cluster ID are considered similar in this
> preliminary view. This phase uses a two-pass clustering algorithm. During
> the first pass all statements are gone over and each new statement is
> compared with the anti-unifiers of all existing clusters. During the second
> pass all the statements are traversed again and for each statement we search
> for the most similar pattern from the set produced in the previous pass
> (again using the anti-unification distance). And hashing (using the notion
> of d-cup) is used to speed up the clustering process.
> (2) Find identical sequences of cluster IDs. All pairs of sequences of
> statements which are identically labeled (have the same labels on the same
> position) are searched using a suffix tree approach.
> (3) Refine by examining identified code sequences for structural similarity.
> In this phase, every pair of candidate sequences is checked for overall
> similarity at the statement level, again using anti-unification.
>
> This part could implemented as bug checker.
>
> 3. Detection of Type 3
> AST-based method can only detected Type 3 with low accuracy, because the
> added or deleted instructions strictly change the structure of AST. So we
> should introduce semantic analysis to achieve high accuracy. But algorithms
> based on semantic analysis have high computational complexity, which makes
> them unusable for large software systems analysis. Authors of the paper
> http://mpcs.sci.am/filesimages/volumes/volume_42/05.pdf (Related talk
> http://llvm.org/devmtg/2015-02/slides/sargsyan-code-clone.pdf) proposed a
> method which computes metrics for nodes of PDG and compares them. For every
> node of program dependence graph a characteristic vector is constructed,
> which contains information about neighbors. These characteristic vectors are
> represented as 64-bit integer numbers, which allows determining similarity
> between two nodes in amortized constant time. The paper also describes new
> approach for dependency graphs generation, which allows building them much
> faster than any of the existing methods.
> IMO we can imitate the method to implement a bug checker of type 3. But we
> don’t focus on IR level.
>
> In the mean time, I will develop different test cases of different clone
> types to test precision and recall of detection. Then a thorough test suite
> will be developed.
> Finally, I will commit detailed documentations and prepare a final poster of
> the work.
>
> *Timeline*
> TODO
>
> Best regards,
> Jiayi Ye
>
>
>
> --
> View this message in context: http://clang-developers.42468.n3.nabble.com/CopyPaste-detection-clang-static-analyzer-tp4037634p4044606.html
> Sent from the Clang Developers mailing list archive at Nabble.com.
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev