[cfe-dev] GSoC proposal - Finding and analysing copy-pasted code with clang

Tue Apr 5 17:01:23 PDT 2016

Hi Anna,

thanks for the comments!

I wrote my answers inline (see below):

2016-04-04 23:28 GMT+02:00 Anna Zaks <ganna at apple.com>:
> Hi Vassil and Raphael,
>
> Sorry for the delay, I just got to reading your proposal. Below are some
> comments.
>
> If I understand correctly, you are proposing to:
>  1) Add another stand-alone tool + a library that performs cross-translation
> unit clone detection on AST-level.
>  2) Add a checker to the Clang Static Analyzer that performs (the same?)
> clone detection but limited to a single translation unit.
>

That's right. And it is the same clone detection mechanism.

> How much code reuse will there be between the two? Will the stand-alone tool
> be built on top of the checker? I did not get that feeling from the
> proposal, especially, since the stand-alone tool will be completed first. It
> seems that all of the goals mentioned in the proposal except for the cross
> translation unit analysis could be done in the static analyzer. So why not
> start with that? I think it would be very beneficial for the project to have
> some clone detection committed in tree, immediately available to all of the
> existing users of the static analyzer!

My current plan is to have some logic for finding clones in ASTs that
is used by the standalone tool and the checker. They are both just
wrappers around this clone detection code that provide input and report the
output in a appropriate way.

The reason why I put the standalone tool as the first point in the
schedule is just that it's faster to develop the early versions of the
clone detection code inside the smaller codebase.
Rebuilding clang takes a few minutes, rebuilding the standalone tool
takes a few seconds.

But I don't see any problem with finishing the checker first as soon
as the clone detection code is mature enough.

> One of the obstacles in contributing the existing checker to the analyzer is
> issue reporting. Have you considered reporting the subsequent clones as a
> note on the first clone? The clones are related and it looks like the
> current output does not highlight that. (The static analyzer does not
> support notes right now, so you'd need to extend that functionality.)

No, I haven't considered that so far but it sounds like something we should do!

> I am very apprehensive about adding yet another analysis tool to the clang
> ecosystem. Having clang-tidy, the Clang Static Analyzer + yet another tool
> would be quite confusing to the user. The most user friendly approach is to
> have a single tool that highlights all the problems the users have in their
> code. I do acknowledge that it would not be possible to make the clone
> checker scale to cross translation unit analysis since we do not currently
> have the infrastructure to support that. However, building the stand-alone
> tool on top of the checker would allow turning it into a cross-translation
> unit checker once the infrastructure is added to the static analyzer.

I see the problem for the user but I'm not sure if I understood the
last sentence of this paragraph correctly. Isn't the standalone tool
always obsolete as soon as there is cross-TU support in the static
analyzer?

Also, are there already any plans on how the this cross-TU support
will look like? Because I'm interested how the problem with memory
consumption would be handled as keeping all TUs of a project in memory
is something I assume to be impossible for big projects. Someone
actually dropped me an email last week confirming this.

The standalone tool can do some trickery to get around this by
discarding TUs after the hashing the nodes and then reloading them if
necessary, but that obviously only works in the special use case of
this project.

> Have you looked into CodeChecker and the new scan-build.py projects? They do
> rely on using the compilation database, which is something you plan on doing
> as well. Can you reuse scan-build.py instead of writing your own build
> interposition? The goal of CodeChecker is to collect and display static
> analysis reports generated by all clang-based tools, specifically, both
> clang-tidy and the Clang Static Analyzer are already supported. It would be
> valuable if the new stand-alone tools would be incorporated into the same
> workflow. This way the users could have a single point of entry when they
> look for bugs.
>
> CodeChecker incorporates a nice bug viewing UI. Integrating clone reporting
> into that UI would be great. However, you might need to extend/modify both
> the reporting and the UI to make it look great.

I just had time for a quick peek at CodeChecker, so I can only say it
sounds like a good idea, but I can't say if I work on that without
knowing how much work it will be in the end.

Lazlo already pointed out to scan-build.py to generate the compilation
database while writing proposal and I already use it for generating
testing databases, so using scan-build.py for interposition is
confirmed I guess.

> What do you think?
> Anna.

I hope I didn't miss to answer something, but if I did, please point it out!

Cheers,

- Raphael