[cfe-dev] CopyPaste detection clang static analyzer

Fri Mar 27 08:07:00 PDT 2015

The modified version is as follows. Thanks.

*Introduction*

We begin with a basic introduction to clone detection. There are four types
of clones.

Type 1 (Exact clones): Identical code fragments except small variations in
white spaces, layout and comments.
Type 2 (Renamed clones): Syntactically identical code fragments except for
variations in literals, identifiers, types, white spaces, layout and
comments .
Type 3 (Modified clones): These are copied fragments with additional
modifications such as changed, added or removed statements, in addition to
variations in literals, identifiers, types, white spaces, layout and
comments.
Type 4 (Semantic clones): Two or more code fragments that perform the same
computation but are implemented by different syntactic variants.

Test Cases for Type 1, 2, 3, 4 are as follows.

1.Type 1

// expect warning because the code in “if” and the code in “else” is
identical

int test1(int expr) {
    if (expr) {
       int a = 3, b = 4;
       int c = a + b; // add a and b
       return c;
    }
    else {
    int a=3,b=4;
    /* add ‘a’ and ‘b’ */
    int c=a+b;
    return c;
    }
}

// expect warning because expressions are same on either side of condition
int test2(int a, int b, int expr){
    return expr ? a  +  b : a+b;
}

// expect warning because “code1()” repeats in conditional expression
extern bool code1();
extern bool code2();
extern bool code3();
int test3() {
    if ( code1() && code2() && code3() && code1() ) {
       return 1;
    }
    return 0;
}

2. Type 2
struct S_STRUCT{
    int a;
    int b;
}

int test4(S_STRUCT s1, S_STRUCT s2) {
    int sa = (s1.a + s2.a) * (s1.a - s2.a) + s1.a * s2.a;
    int sb = (s1.b + s2.b) * (s1.b - s2.b) + s1.b * s2.a; // except
diagnostics “Do you mean s2.b here”
    return sb - sa;
}

int func ( int a, int b, int c, int d){
    return a + b + c + d;
}

int test5 (S_STRUCT s1, S_STRUCT s2) {
    int a = func (s1.a, s1.b, s1.a * s1.b, s1.a / s1.b);
    int b = func (s2.a, s2.b, s2.a * s2.b, s2.a / s1.b);   // except
diagnostics “Do you mean s2.b here”
    return a + b;

}

3. Type3
#define THRESHOLD 100
int test6 (int a, int b) {
    int sum = 0;
    if (a > THRESHOLD) {
       int x = 3, y = 4;
       sum += a * (x + y);
    }
    if (b >= THRESHOLD) { // except diagnostics “Do you mean > here”
       int x = 3, y = 4;
       sum += b * (x + y);
    }
    return sum;
}

#define THRESHOLD 100
int test6 (int &a, int &b) {
    int sum = 0;
    if (a > THRESHOLD) {
       int x = 3, y = 4;
       sum += a * (x + y);
       a = 0;
    }
    if (b > THRESHOLD) {
       int x = 3, y = 4;
       sum += b * (x + y);
       // except diagnostics “Do you miss b = 0 here”
    }
    return sum;
}

4. Type 4
void test7(){
    func1();
    func2(); // func1() and fun2() are semantically same.
}

int func1(){
    int sum = 0;
    for (int i = 0; i < 10; ++i) {
       sum += i;
    }
    return sum;
}

int func2(){
    int sum = 0;
    int i = 0;
    while(i < 10){
       sum += i;
       ++i;
    }
    return sum;
}
And techniques of code clone detection can be classified as Text-based,
Token-based, Metrics-based, AST-based and PDG-based.

These techniques have pros and cons.  Comparison between them is described
in survey papers
http://research.cs.queensu.ca/TechReports/Reports/2007-541.pdf,
http://ijarcsms.com/docs/paper/volume3/issue1/V3I1-0012.pdf and
http://ijcsn.org/IJCSN-2014/3-2/Review-of-Code-Clone-Detection.pdf.

*Implementation Plan*

I plan to adopt AST-based method described in
http://www.informatik.uni-bremen.de/st/IWSC/bulychev.pdf to detect Type 1,
Type 2 and Type 3. Firstly, the abstract syntax tree of program is first
linearized, i.e., all sequences of statements and definitions are presented
in the abstract tree as sibling subtrees. Then three phases of the
algorithm are performed. Three phases are as follows.

(1) Partition similar statements into clusters using anti-unification.
After the first phase each statement is marked with its cluster ID - thus,
two statements with the same cluster ID are considered similar in this
preliminary view. This phase uses a two-pass clustering algorithm. During
the first pass all statements are gone over and each new statement is
compared with the anti-unifiers of all existing clusters. During the second
pass all the statements are traversed again and for each statement we
search for the most similar pattern from the set produced in the previous
pass (again using the anti-unification distance). And hashing (using the
notion of d-cup) is used to speed up the clustering process.

(2) Find identical sequences of cluster IDs. All pairs of sequences of
statements which are identically labeled (have the same labels on the same
position) are searched using a suffix tree approach.

(3) Refine by examining identified code sequences for structural
similarity. In this phase, every pair of candidate sequences is checked for
overall similarity at the statement level, again using anti-unification.

As regards to Type 3, AST-based method may detected Type 3 with low
accuracy, because the added or deleted instructions strictly change the
structure of AST. So we should introduce semantic analysis to achieve high
accuracy. But algorithms based on semantic analysis have high computational
complexity, which makes them unusable for large software systems analysis.

About Type 3, I still have not figured out a better way to deal with. So at
this time I keep to adopt the method described above. And later I’ll try to
make some change to improve accuracy.

Type 4 is more complex and we don’t focus on it now.

All in all, I plan to develop copy-paste detection in three steps.

1. Compiler Diagnostics Development
I’ll start off with Nick’s patch which have analyzed conditional operators
and if/elseif chains. I’ll improve Nick’s patch to collect all the
expressions through something like a && b &&c && a. In addition, I’ll
implement detection of clone code blocks (consist of sequence of same
expressions).

The detection of Type 1 could be implemented as compiler diagnostics
because it is efficient and with very low false positive rate.

2. Bug Checker Development
I plan to implement the detection of Type 2 and Type 3 as bug checker in
the static analyzer because it introduces guess of programmers’ intention
and may lead to some false positives.

3. Test Suite Development
In the mean time, I will develop different test cases of different clone
types to test precision and recall of detection. Then a thorough test suite
will be developed.

Finally, I will commit detailed documentations and prepare a final poster
of the work.

*Timeline*

*Community bonding period:*
Thanks for initial work done by Nick which have analyzed conditional
operators and if/elseif chains. I’ll start off with the patch and study
documents about compiler diagnostics as well as Clang Static Analyzer
development. In addition, since now the static analyzer has alpha support
for detecting identical expressions using alpha.core.IdenticalExpr, I’ll
check the difference between Nick’s patch and the one supporting
alpha.core.IdenticalExpr. And I’ll try to figure out how to improve
accuracy of detection of Type 3.

*Week 1 - Compiler Diagnostics Development (1)*

Improve Nick’s patch to collect all the expressions through something like
a && b && c && a.

*Week 2 - Compiler Diagnostics Development (2)*
Implement detection of clone code blocks (consist of sequence of same
expressions) using method proposed above.
Implement the first phase -- partitioning similar statements into clusters.
Perform two-pass clustering algorithm to mark each statement with a cluster
ID.

*Week 3 - Compiler Diagnostics Development (3)*
Implement the second phase -- Find pairs of identical cluster sequences.
Use a suffix tree approach to search pairs.

*Week 4 - Compiler Diagnostics Development (4)*
Implement the third phase -- Refine by examining identified code sequences
for structural similarity.
Use anti-unification to check every pair of for overall similarity.
We select pairs that are exactly the same.

*Week 5 - Test Suite Development (Type 1)*
Develop test cases of Type 1
Fix bugs and polish code

*Week 6 - Bug Checker Development (1)*
Implement detection of Type 2, Type 3 as bug checker using the method
proposed above.

*Week 7 - Bug Checker Development (2)*
Implement detection of Type 2, Type 3 as bug checker using the method
proposed above.
Compare the selected clone pairs and make warning such as  “Do you mean
global2 here”

*Week 8 - Test Suite Development (Type 2)*
Develop test cases of Type 2
Fix bugs and polish code

*Week 9 - Bug Checker Development (3)*
Implement improvement of detection of Type 3

*Week 10 -  Test Suite Development (Type 3)*
Develop test cases of Type 3
Fix bugs and polish code

*Week 11 -  Improvement*
Make improvements.
Review all code.
Write documentations.

*Week 12 - Documentations*
Write documentations and prepare a final poster of the work.

On Thu, Mar 26, 2015 at 10:13 PM, Vassil Vassilev <vvasilev at cern.ch> wrote:

> My comments went in google-melange. Could you please post the newest
> version here once again when you finish it?
> Vassil
>
> On 20/03/15 15:52, Jiayi Ye wrote:
>
>> Hi all,
>>
>> I am a student in Peking university, China. And I am pursing my master's
>> degree. I have some experience with LLVM and static analysis such as
>> wrote a
>> LLVM pass to detect use-after-free bugs. I'd love to work in the project
>> copy-paste detection. Please give me some advise about my initial
>> proposal.
>> Thank you so much.
>>
>> *Introduction*
>> We begin with a basic introduction to clone detection. There are three
>> types
>> of clones.
>> Type 1 (Exact clones): Identical code fragments except small variations in
>> white spaces, layout and comments.
>> Type 2 (Renamed clones): Syntactically identical code fragments except for
>> variations in literals, identifiers, types, white spaces, layout and
>> comments .
>> Type 3 (Modified clones): These are copied fragments with additional
>> modifications such as changed, added or removed statements, in addition to
>> variations in literals, identifiers, types, white spaces, layout and
>> comments.
>>
>> And techniques of code clone detection can be classified as follows.
>>
>> These techniques have pros and cons.  Comparison between them is described
>> in survey papers http://ijarcsms.com/docs/paper/volume3/issue1/V3I1-
>> 0012.pdf
>> and http://ijcsn.org/IJCSN-2014/3-2/Review-of-Code-Clone-Detection.pdf.
>>
>> *Implementation Plan*
>> I plan to develop copy-paste detection in three steps.
>> 1.  Detection of Type 1
>> I’ll start off with Nick’s patch which have analyzed conditional operators
>> and if/elseif chains. I’ll improve Nick’s patch to collect all the
>> expressions through something like a && b &&c && a. This part could be
>> implemented as compiler diagnostics because it is efficient and with very
>> low false positive rate.
>>
>> 2. Detection of Type 2
>> I plan to adopt AST-based method described in
>> http://www.informatik.uni-bremen.de/st/IWSC/bulychev.pdf . The abstract
>> syntax tree of program is first linearized, i.e., all sequences of
>> statements and definitions are presented in the abstract tree as sibling
>> subtrees. Then three phases of the algorithm are preformed. Three phases
>> are
>> as follows.
>>
>> (1) Partition similar statements into clusters using anti-unification.
>> After
>> the first phase each statement is marked with its cluster ID – thus, two
>> statements with the same cluster ID are considered similar in this
>> preliminary view. This phase uses a two-pass clustering algorithm. During
>> the first pass all statements are gone over and each new statement is
>> compared with the anti-unifiers of all existing clusters. During the
>> second
>> pass all the statements are traversed again and for each statement we
>> search
>> for the most similar pattern from the set produced in the previous pass
>> (again using the anti-unification distance). And hashing (using the notion
>> of d-cup) is used to speed up the clustering process.
>> (2) Find identical sequences of cluster IDs. All pairs of sequences of
>> statements which are identically labeled (have the same labels on the same
>> position) are searched using a suffix tree approach.
>> (3) Refine by examining identified code sequences for structural
>> similarity.
>> In this phase, every pair of candidate sequences is checked for overall
>> similarity at the statement level, again using anti-unification.
>>
>> This part could implemented as bug checker.
>>
>> 3. Detection of Type 3
>> AST-based method can only detected Type 3 with low accuracy, because the
>> added or deleted instructions strictly change the structure of AST. So we
>> should introduce semantic analysis to achieve high accuracy. But
>> algorithms
>> based on semantic analysis have high computational complexity, which makes
>> them unusable for large software systems analysis. Authors of the paper
>> http://mpcs.sci.am/filesimages/volumes/volume_42/05.pdf (Related talk
>> http://llvm.org/devmtg/2015-02/slides/sargsyan-code-clone.pdf) proposed a
>> method which computes metrics for nodes of PDG and compares them. For
>> every
>> node of program dependence graph a characteristic vector is constructed,
>> which contains information about neighbors. These characteristic vectors
>> are
>> represented as 64-bit integer numbers, which allows determining similarity
>> between two nodes in amortized constant time. The paper also describes new
>> approach for dependency graphs generation, which allows building them much
>> faster than any of the existing methods.
>> IMO we can imitate the method to implement a bug checker of type 3. But we
>> don’t focus on IR level.
>>
>> In the mean time, I will develop different test cases of different clone
>> types to test precision and recall of detection. Then a thorough test
>> suite
>> will be developed.
>> Finally, I will commit detailed documentations and prepare a final poster
>> of
>> the work.
>>
>> *Timeline*
>> TODO
>>
>> Best regards,
>> Jiayi Ye
>>
>>
>>
>> --
>> View this message in context: http://clang-developers.42468.
>> n3.nabble.com/CopyPaste-detection-clang-static-
>> analyzer-tp4037634p4044606.html
>> Sent from the Clang Developers mailing list archive at Nabble.com.
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20150327/0f2c381e/attachment.html>