[cfe-dev] GSoC proposal - Finding and analysing copy-pasted code with clang

Wed Apr 6 15:24:54 PDT 2016

Hi Anna & Mehdi,

having more contact with the LLVM/clang community (which is anyway the
reason I want to participate in the GSoC)
and getting early feedback by working in-tree sounds good to me!

So from my point of view I don't see any problem with following Anna's
plan and Mehdi's suggestions.

Cheers,

- Raphael

2016-04-06 2:43 GMT+02:00 Anna Zaks <ganna at apple.com>:
>
>> On Apr 5, 2016, at 5:01 PM, Raphael Isemann <teemperor at gmail.com> wrote:
>>
>> Hi Anna,
>>
>> thanks for the comments!
>>
>> I wrote my answers inline (see below):
>>
>> 2016-04-04 23:28 GMT+02:00 Anna Zaks <ganna at apple.com>:
>>> Hi Vassil and Raphael,
>>>
>>> Sorry for the delay, I just got to reading your proposal. Below are some
>>> comments.
>>>
>>> If I understand correctly, you are proposing to:
>>> 1) Add another stand-alone tool + a library that performs cross-translation
>>> unit clone detection on AST-level.
>>> 2) Add a checker to the Clang Static Analyzer that performs (the same?)
>>> clone detection but limited to a single translation unit.
>>>
>>
>> That's right. And it is the same clone detection mechanism.
>>
>>> How much code reuse will there be between the two? Will the stand-alone tool
>>> be built on top of the checker? I did not get that feeling from the
>>> proposal, especially, since the stand-alone tool will be completed first. It
>>> seems that all of the goals mentioned in the proposal except for the cross
>>> translation unit analysis could be done in the static analyzer. So why not
>>> start with that? I think it would be very beneficial for the project to have
>>> some clone detection committed in tree, immediately available to all of the
>>> existing users of the static analyzer!
>>
>> My current plan is to have some logic for finding clones in ASTs that
>> is used by the standalone tool and the checker. They are both just
>> wrappers around this clone detection code that provide input and report the
>> output in a appropriate way.
>>
>> The reason why I put the standalone tool as the first point in the
>> schedule is just that it's faster to develop the early versions of the
>> clone detection code inside the smaller codebase.
>> Rebuilding clang takes a few minutes, rebuilding the standalone tool
>> takes a few seconds.
>>
>
> LLVM is following incremental development policy (http://llvm.org/docs/DeveloperPolicy.html#incremental-development). Developing on a branch/ within a separate tool and landing all of the mature changes later on into LLVM would introduce other issues.
>
> I would much rather start with committing a weaker clone checker into the repo, changing the clang/static analyzer so that it can report the clones, and incrementally enhancing the check. I think that alone will be a lot of work! But on the upside, you will have a lot of users.
>
> If you have time left, you could explore building a stand alone tool that calls the checker and extends the clone detection to work cross-TUs. (That should be doable.) Further, we would want to integrate this tool into an issue navigator such as CodeChecker.
>
>> But I don't see any problem with finishing the checker first as soon
>> as the clone detection code is mature enough.
>>
>>> One of the obstacles in contributing the existing checker to the analyzer is
>>> issue reporting. Have you considered reporting the subsequent clones as a
>>> note on the first clone? The clones are related and it looks like the
>>> current output does not highlight that. (The static analyzer does not
>>> support notes right now, so you'd need to extend that functionality.)
>>
>> No, I haven't considered that so far but it sounds like something we should do!
>>
>>> I am very apprehensive about adding yet another analysis tool to the clang
>>> ecosystem. Having clang-tidy, the Clang Static Analyzer + yet another tool
>>> would be quite confusing to the user. The most user friendly approach is to
>>> have a single tool that highlights all the problems the users have in their
>>> code. I do acknowledge that it would not be possible to make the clone
>>> checker scale to cross translation unit analysis since we do not currently
>>> have the infrastructure to support that. However, building the stand-alone
>>> tool on top of the checker would allow turning it into a cross-translation
>>> unit checker once the infrastructure is added to the static analyzer.
>>
>> I see the problem for the user but I'm not sure if I understood the
>> last sentence of this paragraph correctly. Isn't the standalone tool
>> always obsolete as soon as there is cross-TU support in the static
>> analyzer?
>>
>
> Not necessarily. Additional work will be required to make this particular checker work cross translation units. If you design the tool so that it is built using the checker as a primitive, less work would need to be done to add cross translation unit support in the analyzer.
>
>> Also, are there already any plans on how the this cross-TU support
>> will look like? Because I'm interested how the problem with memory
>> consumption would be handled as keeping all TUs of a project in memory
>> is something I assume to be impossible for big projects.
>
> Keeping all TU in memory is not an option we are considering. Most likely we would be building summaries for functions and using those for analysis. There were various discussions on this topic on the list but it is not clear what would be the best approach.
>
>> Someone
>> actually dropped me an email last week confirming this.
>>
>> The standalone tool can do some trickery to get around this by
>> discarding TUs after the hashing the nodes and then reloading them if
>> necessary, but that obviously only works in the special use case of
>> this project.
>
> Once you have more specifics, let's discuss them.
>
>>
>>> Have you looked into CodeChecker and the new scan-build.py projects? They do
>>> rely on using the compilation database, which is something you plan on doing
>>> as well. Can you reuse scan-build.py instead of writing your own build
>>> interposition? The goal of CodeChecker is to collect and display static
>>> analysis reports generated by all clang-based tools, specifically, both
>>> clang-tidy and the Clang Static Analyzer are already supported. It would be
>>> valuable if the new stand-alone tools would be incorporated into the same
>>> workflow. This way the users could have a single point of entry when they
>>> look for bugs.
>>>
>>> CodeChecker incorporates a nice bug viewing UI. Integrating clone reporting
>>> into that UI would be great. However, you might need to extend/modify both
>>> the reporting and the UI to make it look great.
>>
>> I just had time for a quick peek at CodeChecker, so I can only say it
>> sounds like a good idea, but I can't say if I work on that without
>> knowing how much work it will be in the end.
>>
>> Lazlo already pointed out to scan-build.py to generate the compilation
>> database while writing proposal and I already use it for generating
>> testing databases, so using scan-build.py for interposition is
>> confirmed I guess.
>>
>>> What do you think?
>>> Anna.
>>
>> I hope I didn't miss to answer something, but if I did, please point it out!
>>
>> Cheers,
>>
>> - Raphael
>