<div dir="ltr"><div>The modified version is as follows. Thanks.</div><div><br></div><div><b>Introduction</b></div><div><br></div><div>We begin with a basic introduction to clone detection. There are four types of clones.</div><div><br></div><div>Type 1 (Exact clones): Identical code fragments except small variations in white spaces, layout and comments.</div><div>Type 2 (Renamed clones): Syntactically identical code fragments except for variations in literals, identifiers, types, white spaces, layout and comments .</div><div>Type 3 (Modified clones): These are copied fragments with additional modifications such as changed, added or removed statements, in addition to variations in literals, identifiers, types, white spaces, layout and comments.</div><div>Type 4 (Semantic clones): Two or more code fragments that perform the same computation but are implemented by different syntactic variants.</div><div><br></div><div>Test Cases for Type 1, 2, 3, 4 are as follows.</div><div><br></div><div>1.Type 1</div><div><br></div><div>// expect warning because the code in “if” and the code in “else” is identical</div><div><br></div><div>int test1(int expr) {</div><div> if (expr) {</div><div> int a = 3, b = 4;</div><div> int c = a + b; // add a and b</div><div> return c;</div><div> }</div><div> else {</div><div> int a=3,b=4;</div><div> /* add ‘a’ and ‘b’ */</div><div> int c=a+b;</div><div> return c;</div><div> }</div><div>} </div><div><br></div><div>// expect warning because expressions are same on either side of condition</div><div>int test2(int a, int b, int expr){</div><div> return expr ? a + b : a+b;</div><div>}</div><div><br></div><div>// expect warning because “code1()” repeats in conditional expression</div><div>extern bool code1();</div><div>extern bool code2();</div><div>extern bool code3();</div><div>int test3() {</div><div> if ( code1() && code2() && code3() && code1() ) {</div><div> return 1;</div><div> }</div><div> return 0;</div><div>}</div><div><br></div><div>2. Type 2</div><div>struct S_STRUCT{</div><div> int a;</div><div> int b;</div><div>}</div><div><br></div><div>int test4(S_STRUCT s1, S_STRUCT s2) {</div><div> int sa = (s1.a + s2.a) * (s1.a - s2.a) + s1.a * s2.a;</div><div> int sb = (s1.b + s2.b) * (s1.b - s2.b) + s1.b * s2.a; // except diagnostics “Do you mean s2.b here”</div><div> return sb - sa;</div><div>} </div><div><br></div><div>int func ( int a, int b, int c, int d){</div><div> return a + b + c + d;</div><div>}</div><div><br></div><div>int test5 (S_STRUCT s1, S_STRUCT s2) {</div><div> int a = func (s1.a, s1.b, s1.a * s1.b, s1.a / s1.b);</div><div> int b = func (s2.a, s2.b, s2.a * s2.b, s2.a / s1.b); // except diagnostics “Do you mean s2.b here”</div><div> return a + b;</div><div><br></div><div>}</div><div><br></div><div>3. Type3</div><div>#define THRESHOLD 100</div><div>int test6 (int a, int b) {</div><div> int sum = 0;</div><div> if (a > THRESHOLD) {</div><div> int x = 3, y = 4;</div><div> sum += a * (x + y);</div><div> }</div><div> if (b >= THRESHOLD) { // except diagnostics “Do you mean > here”</div><div> int x = 3, y = 4;</div><div> sum += b * (x + y);</div><div> } </div><div> return sum;</div><div>} </div><div><br></div><div>#define THRESHOLD 100</div><div>int test6 (int &a, int &b) {</div><div> int sum = 0;</div><div> if (a > THRESHOLD) {</div><div> int x = 3, y = 4;</div><div> sum += a * (x + y);</div><div> a = 0;</div><div> }</div><div> if (b > THRESHOLD) {</div><div> int x = 3, y = 4;</div><div> sum += b * (x + y);</div><div> // except diagnostics “Do you miss b = 0 here”</div><div> }</div><div> return sum;</div><div>}</div><div><br></div><div>4. Type 4</div><div>void test7(){</div><div> func1();</div><div> func2(); // func1() and fun2() are semantically same.</div><div>}</div><div><br></div><div>int func1(){</div><div> int sum = 0;</div><div> for (int i = 0; i < 10; ++i) {</div><div> sum += i;</div><div> }</div><div> return sum;</div><div>} </div><div><br></div><div>int func2(){</div><div> int sum = 0;</div><div> int i = 0;</div><div> while(i < 10){</div><div> sum += i;</div><div> ++i;</div><div> }</div><div> return sum;</div><div>}</div><div>And techniques of code clone detection can be classified as Text-based, Token-based, Metrics-based, AST-based and PDG-based.</div><div><br></div><div>These techniques have pros and cons. Comparison between them is described in survey papers <a href="http://research.cs.queensu.ca/TechReports/Reports/2007-541.pdf">http://research.cs.queensu.ca/TechReports/Reports/2007-541.pdf</a>, <a href="http://ijarcsms.com/docs/paper/volume3/issue1/V3I1-0012.pdf">http://ijarcsms.com/docs/paper/volume3/issue1/V3I1-0012.pdf</a> and <a href="http://ijcsn.org/IJCSN-2014/3-2/Review-of-Code-Clone-Detection.pdf">http://ijcsn.org/IJCSN-2014/3-2/Review-of-Code-Clone-Detection.pdf</a>.</div><div><br></div><div><b>Implementation Plan</b></div><div><br></div><div>I plan to adopt AST-based method described in <a href="http://www.informatik.uni-bremen.de/st/IWSC/bulychev.pdf">http://www.informatik.uni-bremen.de/st/IWSC/bulychev.pdf</a> to detect Type 1, Type 2 and Type 3. Firstly, the abstract syntax tree of program is first linearized, i.e., all sequences of statements and definitions are presented in the abstract tree as sibling subtrees. Then three phases of the algorithm are performed. Three phases are as follows.</div><div><br></div><div>(1) Partition similar statements into clusters using anti-unification. After the first phase each statement is marked with its cluster ID - thus, two statements with the same cluster ID are considered similar in this preliminary view. This phase uses a two-pass clustering algorithm. During the first pass all statements are gone over and each new statement is compared with the anti-unifiers of all existing clusters. During the second pass all the statements are traversed again and for each statement we search for the most similar pattern from the set produced in the previous pass (again using the anti-unification distance). And hashing (using the notion of d-cup) is used to speed up the clustering process.</div><div><br></div><div>(2) Find identical sequences of cluster IDs. All pairs of sequences of statements which are identically labeled (have the same labels on the same position) are searched using a suffix tree approach. </div><div><br></div><div>(3) Refine by examining identified code sequences for structural similarity. In this phase, every pair of candidate sequences is checked for overall similarity at the statement level, again using anti-unification.</div><div><br></div><div>As regards to Type 3, AST-based method may detected Type 3 with low accuracy, because the added or deleted instructions strictly change the structure of AST. So we should introduce semantic analysis to achieve high accuracy. But algorithms based on semantic analysis have high computational complexity, which makes them unusable for large software systems analysis.</div><div><br></div><div>About Type 3, I still have not figured out a better way to deal with. So at this time I keep to adopt the method described above. And later I’ll try to make some change to improve accuracy.</div><div><br></div><div>Type 4 is more complex and we don’t focus on it now.</div><div><br></div><div>All in all, I plan to develop copy-paste detection in three steps.</div><div><br></div><div>1. Compiler Diagnostics Development</div><div>I’ll start off with Nick’s patch which have analyzed conditional operators and if/elseif chains. I’ll improve Nick’s patch to collect all the expressions through something like a && b &&c && a. In addition, I’ll implement detection of clone code blocks (consist of sequence of same expressions).</div><div><br></div><div>The detection of Type 1 could be implemented as compiler diagnostics because it is efficient and with very low false positive rate.</div><div><br></div><div>2. Bug Checker Development</div><div>I plan to implement the detection of Type 2 and Type 3 as bug checker in the static analyzer because it introduces guess of programmers’ intention and may lead to some false positives.</div><div><br></div><div>3. Test Suite Development</div><div>In the mean time, I will develop different test cases of different clone types to test precision and recall of detection. Then a thorough test suite will be developed.</div><div><br></div><div>Finally, I will commit detailed documentations and prepare a final poster of the work.</div><div><br></div><div><b>Timeline</b></div><div><i><br></i></div><div><i>Community bonding period:</i></div><div>Thanks for initial work done by Nick which have analyzed conditional operators and if/elseif chains. I’ll start off with the patch and study documents about compiler diagnostics as well as Clang Static Analyzer development. In addition, since now the static analyzer has alpha support for detecting identical expressions using alpha.core.IdenticalExpr, I’ll check the difference between Nick’s patch and the one supporting alpha.core.IdenticalExpr. And I’ll try to figure out how to improve accuracy of detection of Type 3.</div><div><br></div><div><i>Week 1 - Compiler Diagnostics Development (1)</i></div><div><br></div><div>Improve Nick’s patch to collect all the expressions through something like a && b && c && a.</div><div><br></div><div><i>Week 2 - Compiler Diagnostics Development (2)</i></div><div>Implement detection of clone code blocks (consist of sequence of same expressions) using method proposed above.</div><div>Implement the first phase -- partitioning similar statements into clusters.</div><div>Perform two-pass clustering algorithm to mark each statement with a cluster ID.</div><div><br></div><div><i>Week 3 - Compiler Diagnostics Development (3)</i></div><div>Implement the second phase -- Find pairs of identical cluster sequences.</div><div>Use a suffix tree approach to search pairs.</div><div><br></div><div><i>Week 4 - Compiler Diagnostics Development (4)</i></div><div>Implement the third phase -- Refine by examining identified code sequences for structural similarity.</div><div>Use anti-unification to check every pair of for overall similarity.</div><div>We select pairs that are exactly the same. </div><div><br></div><div><i>Week 5 - Test Suite Development (Type 1)</i></div><div>Develop test cases of Type 1</div><div>Fix bugs and polish code</div><div><br></div><div><i>Week 6 - Bug Checker Development (1)</i></div><div>Implement detection of Type 2, Type 3 as bug checker using the method proposed above.</div><div><br></div><div><i>Week 7 - Bug Checker Development (2)</i></div><div>Implement detection of Type 2, Type 3 as bug checker using the method proposed above.</div><div>Compare the selected clone pairs and make warning such as “Do you mean global2 here” </div><div><br></div><div><i>Week 8 - Test Suite Development (Type 2)</i></div><div>Develop test cases of Type 2</div><div>Fix bugs and polish code</div><div><br></div><div><i>Week 9 - Bug Checker Development (3)</i></div><div>Implement improvement of detection of Type 3</div><div><br></div><div><i>Week 10 - Test Suite Development (Type 3)</i></div><div>Develop test cases of Type 3</div><div>Fix bugs and polish code</div><div><br></div><div><i>Week 11 - Improvement</i></div><div>Make improvements.</div><div>Review all code.</div><div>Write documentations.</div><div><br></div><div><i>Week 12 - Documentations</i></div><div>Write documentations and prepare a final poster of the work.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Mar 26, 2015 at 10:13 PM, Vassil Vassilev <span dir="ltr"><<a href="mailto:vvasilev@cern.ch" target="_blank">vvasilev@cern.ch</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">My comments went in google-melange. Could you please post the newest version here once again when you finish it?<span class="HOEnZb"><font color="#888888"><br>
Vassil</font></span><div class="HOEnZb"><div class="h5"><br>
On 20/03/15 15:52, Jiayi Ye wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi all,<br>
<br>
I am a student in Peking university, China. And I am pursing my master's<br>
degree. I have some experience with LLVM and static analysis such as wrote a<br>
LLVM pass to detect use-after-free bugs. I'd love to work in the project<br>
copy-paste detection. Please give me some advise about my initial proposal.<br>
Thank you so much.<br>
<br>
*Introduction*<br>
We begin with a basic introduction to clone detection. There are three types<br>
of clones.<br>
Type 1 (Exact clones): Identical code fragments except small variations in<br>
white spaces, layout and comments.<br>
Type 2 (Renamed clones): Syntactically identical code fragments except for<br>
variations in literals, identifiers, types, white spaces, layout and<br>
comments .<br>
Type 3 (Modified clones): These are copied fragments with additional<br>
modifications such as changed, added or removed statements, in addition to<br>
variations in literals, identifiers, types, white spaces, layout and<br>
comments.<br>
<br>
And techniques of code clone detection can be classified as follows.<br>
<br>
These techniques have pros and cons. Comparison between them is described<br>
in survey papers <a href="http://ijarcsms.com/docs/paper/volume3/issue1/V3I1-0012.pdf" target="_blank">http://ijarcsms.com/docs/<u></u>paper/volume3/issue1/V3I1-<u></u>0012.pdf</a><br>
and <a href="http://ijcsn.org/IJCSN-2014/3-2/Review-of-Code-Clone-Detection.pdf" target="_blank">http://ijcsn.org/IJCSN-2014/3-<u></u>2/Review-of-Code-Clone-<u></u>Detection.pdf</a>.<br>
<br>
*Implementation Plan*<br>
I plan to develop copy-paste detection in three steps.<br>
1. Detection of Type 1<br>
I’ll start off with Nick’s patch which have analyzed conditional operators<br>
and if/elseif chains. I’ll improve Nick’s patch to collect all the<br>
expressions through something like a && b &&c && a. This part could be<br>
implemented as compiler diagnostics because it is efficient and with very<br>
low false positive rate.<br>
<br>
2. Detection of Type 2<br>
I plan to adopt AST-based method described in<br>
<a href="http://www.informatik.uni-bremen.de/st/IWSC/bulychev.pdf" target="_blank">http://www.informatik.uni-<u></u>bremen.de/st/IWSC/bulychev.pdf</a> . The abstract<br>
syntax tree of program is first linearized, i.e., all sequences of<br>
statements and definitions are presented in the abstract tree as sibling<br>
subtrees. Then three phases of the algorithm are preformed. Three phases are<br>
as follows.<br>
<br>
(1) Partition similar statements into clusters using anti-unification. After<br>
the first phase each statement is marked with its cluster ID – thus, two<br>
statements with the same cluster ID are considered similar in this<br>
preliminary view. This phase uses a two-pass clustering algorithm. During<br>
the first pass all statements are gone over and each new statement is<br>
compared with the anti-unifiers of all existing clusters. During the second<br>
pass all the statements are traversed again and for each statement we search<br>
for the most similar pattern from the set produced in the previous pass<br>
(again using the anti-unification distance). And hashing (using the notion<br>
of d-cup) is used to speed up the clustering process.<br>
(2) Find identical sequences of cluster IDs. All pairs of sequences of<br>
statements which are identically labeled (have the same labels on the same<br>
position) are searched using a suffix tree approach.<br>
(3) Refine by examining identified code sequences for structural similarity.<br>
In this phase, every pair of candidate sequences is checked for overall<br>
similarity at the statement level, again using anti-unification.<br>
<br>
This part could implemented as bug checker.<br>
<br>
3. Detection of Type 3<br>
AST-based method can only detected Type 3 with low accuracy, because the<br>
added or deleted instructions strictly change the structure of AST. So we<br>
should introduce semantic analysis to achieve high accuracy. But algorithms<br>
based on semantic analysis have high computational complexity, which makes<br>
them unusable for large software systems analysis. Authors of the paper<br>
<a href="http://mpcs.sci.am/filesimages/volumes/volume_42/05.pdf" target="_blank">http://mpcs.sci.am/<u></u>filesimages/volumes/volume_42/<u></u>05.pdf</a> (Related talk<br>
<a href="http://llvm.org/devmtg/2015-02/slides/sargsyan-code-clone.pdf" target="_blank">http://llvm.org/devmtg/2015-<u></u>02/slides/sargsyan-code-clone.<u></u>pdf</a>) proposed a<br>
method which computes metrics for nodes of PDG and compares them. For every<br>
node of program dependence graph a characteristic vector is constructed,<br>
which contains information about neighbors. These characteristic vectors are<br>
represented as 64-bit integer numbers, which allows determining similarity<br>
between two nodes in amortized constant time. The paper also describes new<br>
approach for dependency graphs generation, which allows building them much<br>
faster than any of the existing methods.<br>
IMO we can imitate the method to implement a bug checker of type 3. But we<br>
don’t focus on IR level.<br>
<br>
In the mean time, I will develop different test cases of different clone<br>
types to test precision and recall of detection. Then a thorough test suite<br>
will be developed.<br>
Finally, I will commit detailed documentations and prepare a final poster of<br>
the work.<br>
<br>
*Timeline*<br>
TODO<br>
<br>
Best regards,<br>
Jiayi Ye<br>
<br>
<br>
<br>
--<br>
View this message in context: <a href="http://clang-developers.42468.n3.nabble.com/CopyPaste-detection-clang-static-analyzer-tp4037634p4044606.html" target="_blank">http://clang-developers.42468.<u></u>n3.nabble.com/CopyPaste-<u></u>detection-clang-static-<u></u>analyzer-tp4037634p4044606.<u></u>html</a><br>
Sent from the Clang Developers mailing list archive at Nabble.com.<br>
<br>
______________________________<u></u>_________________<br>
cfe-dev mailing list<br>
<a href="mailto:cfe-dev@cs.uiuc.edu" target="_blank">cfe-dev@cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev" target="_blank">http://lists.cs.uiuc.edu/<u></u>mailman/listinfo/cfe-dev</a><br>
</blockquote>
<br>
</div></div></blockquote></div><br></div>