[cfe-dev] CopyPaste detection clang static analyzer

Marshall Clow mclow.lists at gmail.com
Fri Feb 7 06:53:22 PST 2014


On Feb 7, 2014, at 4:49 AM, Vassil Vassilev <vvasilev at cern.ch> wrote:

> On 05/02/14 21:32, Nick Lewycky wrote:
>> On 3 February 2014 14:08, Richard <legalize at xmission.com> wrote:
>> 
>> In article <CAENS6EsgzhXWfANFze8VAp68qDGHnrHNZJaaLmi28YJtnQwOmw at mail.gmail.com>,
>>     David Blaikie <dblaikie at gmail.com> writes:
>> 
>> > On Mon, Feb 3, 2014 at 3:06 AM, Vassil Vassilev <vvasilev at cern.ch> wrote:
>> >
>> > >   A few months ago I was looking for a copy-paste detector for a C++
>> > > project. I didn't find such a feature of clang's static analyzer. Is this
>> > > the case?
>> >
>> > copy-paste detector? As in plagarism detection?
>> 
>> I don't think plagiarism is the concern.  The conern is that
>> copy/paste of blocks of code where the pasted block needs to be
>> updated in several places, but not all of the updates were performed.
>> 
>> I've implemented this sort of thing, but it's only 80% finished and has been kicking around on the low-priority end of my todo list for the past couple of years. Patch attached. It'd be great if someone were interested in finishing this off. I won't get to it soon.
>> 
>> Note that it's a warning instead of a static analysis check which means that it must have an aggressively low number of false positives, and that it must be run quickly. The implementation I have analyzes conditional operators and if/elseif chains, but doesn't collect all the expressions through something like a && b &&c && a. That would be the next thing to add.
>> 
>> It does have some really cool properties that we can only get because clang integrates closely with its preprocessor. Consider this sample from the testcase:
>> 
>> #define num_cpus() (1)
>> #define max_omp_threads() (1)
>> int test8(int expr) {
>>   if (expr) {
>>     return num_cpus();
>>   } else {
>>     return max_omp_threads();
>>   }
>> }
>> 
>> We know better than to warn on that, even though the AST looks the same. If you instead write "return num_cpus();" twice, we warn on that (that's test9 in the testsuite).
>> 
>> Nick
> Thanks this looks very interesting. This may be a good start for a student. IIUC a non-unique expr is the ones that have same source ranges and same FileIDs, right? Could this be upgraded to AST-node (structural) comparison?

I’d love to see a tool with this kind of functionality as part of llvm.
There is a commercial tool called “Pattern Insight” that does stuff like this. http://patterninsight.com/

Here’s another use case for you.

I’ve been in groups that have used it in the past (careful locution; I haven’t personally used it), and occasionally it finds some amazing things.
The best example (from our use):

Code block #1 is about 50 lines of code, with references to a global variable (global1, global1, global1, global1, global1).
Code block #2 is an obviously duplicated and edited block of code, with references to (global2, global2, global2, global1, global2).

Pattern Insight, while looking through this code base, emitted a message to the effect “Are you sure you don’t mean ‘global2’ here?”
(and was correct)

— Marshall

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20140207/cba6e4ea/attachment.html>


More information about the cfe-dev mailing list