[cfe-dev] CopyPaste detection clang static analyzer

Vassil Vassilev vvasilev at cern.ch
Tue Feb 10 04:06:08 PST 2015


Hi all,
   I just wanted to bump this up (given GSoC is starting). I didn't 
manage to get a good student for this project (proposal is below) last 
year :(. I thought maybe if we went through the LLVM mentoring 
organization would be better. Do you think this would make a good GSoC 
project from Clang's perspective? I'd be happy to update the proposal to 
make it more attractive or general-purpose.
Vassil


      Code copy/paste detection

*Description*:The copy/paste is common programming practice. Most of the 
programmers start from a code snippet that already exists in the system 
and modify it to match their needs. Easily some of the code snippets end 
up being copied dozens of times, which leads to worse maintainability, 
understandability and logical design. Clang(link is external) 
<http://clang.llvm.org> and clang's static analyzer(link is external) 
<http://http://clang-analyzer.llvm.org/> provide all the building blocks 
to build a generic C/C++ copy/paste detector.
*Expected results*:Build a standalone tool or clang plugin being able to 
detect copy/pasted code. Lay the foundations of detection of slightly 
modified code (semantic analysis required). Implement tests for all the 
realized functionality. Prepare a final poster of the work and be ready 
to present it.
*Required knowledge*: Advanced C++, Basic knowledge of Clang/Clang 
Static Analyzer.

*Mentor*: Vassil Vassilev/ maybe somebody else as second mentor?
<mailto:sft-gsoc-AT-cern-dot-ch?subject=GSoC%202014%20Extending%20Cling>


On 07/02/14 22:20, Nick Lewycky wrote:
> On 7 February 2014 04:49, Vassil Vassilev <vvasilev at cern.ch 
> <mailto:vvasilev at cern.ch>> wrote:
>
>     On 05/02/14 21:32, Nick Lewycky wrote:
>>     On 3 February 2014 14:08, Richard <legalize at xmission.com
>>     <mailto:legalize at xmission.com>> wrote:
>>
>>
>>         In article
>>         <CAENS6EsgzhXWfANFze8VAp68qDGHnrHNZJaaLmi28YJtnQwOmw at mail.gmail.com
>>         <mailto:CAENS6EsgzhXWfANFze8VAp68qDGHnrHNZJaaLmi28YJtnQwOmw at mail.gmail.com>>,
>>             David Blaikie <dblaikie at gmail.com
>>         <mailto:dblaikie at gmail.com>> writes:
>>
>>         > On Mon, Feb 3, 2014 at 3:06 AM, Vassil Vassilev
>>         <vvasilev at cern.ch <mailto:vvasilev at cern.ch>> wrote:
>>         >
>>         > >   A few months ago I was looking for a copy-paste
>>         detector for a C++
>>         > > project. I didn't find such a feature of clang's static
>>         analyzer. Is this
>>         > > the case?
>>         >
>>         > copy-paste detector? As in plagarism detection?
>>
>>         I don't think plagiarism is the concern.  The conern is that
>>         copy/paste of blocks of code where the pasted block needs to be
>>         updated in several places, but not all of the updates were
>>         performed.
>>
>>
>>     I've implemented this sort of thing, but it's only 80% finished
>>     and has been kicking around on the low-priority end of my todo
>>     list for the past couple of years. Patch attached. It'd be great
>>     if someone were interested in finishing this off. I won't get to
>>     it soon.
>>
>>     Note that it's a warning instead of a static analysis check which
>>     means that it must have an aggressively low number of false
>>     positives, and that it must be run quickly. The implementation I
>>     have analyzes conditional operators and if/elseif chains, but
>>     doesn't collect all the expressions through something like a && b
>>     &&c && a. That would be the next thing to add.
>>
>>     It does have some really cool properties that we can only get
>>     because clang integrates closely with its preprocessor. Consider
>>     this sample from the testcase:
>>
>>     #define num_cpus() (1)
>>     #define max_omp_threads() (1)
>>     int test8(int expr) {
>>       if (expr) {
>>         return num_cpus();
>>       } else {
>>         return max_omp_threads();
>>       }
>>     }
>>
>>     We know better than to warn on that, even though the AST looks
>>     the same. If you instead write "return num_cpus();" twice, we
>>     warn on that (that's test9 in the testsuite).
>>
>>     Nick
>     Thanks this looks very interesting. This may be a good start for a
>     student. IIUC a non-unique expr is the ones that have same source
>     ranges and same FileIDs, right? Could this be upgraded to AST-node
>     (structural) comparison?
>
>
> It is an AST-node comparison. In order to handle the case of different 
> macros, we ask the AST nodes what their SourceLocation was, and factor 
> in the macroid, if there was one. A large part of the patch is a 
> change to the Stmt::profile logic to look at all the sourcelocations 
> in all the possible AST nodes.
>
>
>     Vassil
>
>>
>>         Coverity can detect such instances, for instance.
>>
>>         Here is an article from 2006 describing such a tool:
>>         <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.123.113>
>>
>>         Wikipedia says PMD has a copy/paste detector that works with C++:
>>         <http://en.wikipedia.org/wiki/PMD_(software)#Copy.2FPaste_Detector_.28CPD.29
>>         <http://en.wikipedia.org/wiki/PMD_%28software%29#Copy.2FPaste_Detector_.28CPD.29>>
>>
>>         "Note that CPD works with Java, JSP, C, C++, C#, Fortran and
>>         PHP code.
>>         Your own language is missing ? See how to add it here"
>>         <http://pmd.sourceforge.net/snapshot/cpd-usage.html>
>>         --
>>         "The Direct3D Graphics Pipeline" free book
>>         <http://tinyurl.com/d3d-pipeline>
>>              The Computer Graphics Museum
>>         <http://ComputerGraphicsMuseum.org>
>>                  The Terminals Wiki <http://terminals.classiccmp.org>
>>           Legalize Adulthood! (my blog)
>>         <http://LegalizeAdulthood.wordpress.com>
>>         _______________________________________________
>>         cfe-dev mailing list
>>         cfe-dev at cs.uiuc.edu <mailto:cfe-dev at cs.uiuc.edu>
>>         http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>>
>>
>>
>>
>>     _______________________________________________
>>     cfe-dev mailing list
>>     cfe-dev at cs.uiuc.edu  <mailto:cfe-dev at cs.uiuc.edu>
>>     http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>
>


-- 
--------------------------------------------
Q: Why is this email five sentences or less?
A: http://five.sentenc.es

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20150210/51c81094/attachment.html>


More information about the cfe-dev mailing list