[llvm-dev] [GSoC 2017] Clang-based diff tool project

Thu Mar 23 12:52:56 PDT 2017

> On Mar 23, 2017, at 12:28 PM, Johannes Altmanninger <aclopte at gmail.com> wrote:
> 
> I am currently considering the gumtree algorithm which is described in
> [1]. It is able to detect code moves, updates.  There is also a java
> implementation of gumtree.  It supports C++ via srcML, however this
> seems to be quite incomplete and buggy. The algorithm consists of a top
> down and a bottom up traversal of the AST that result in a matching of
> corresponding nodes. It contains some heuristics similarity measurement.
> 
> Greg Clayton <clayborg at gmail.com <mailto:clayborg at gmail.com>> writes:
> 
>> My original idea was to write a semantic diff tool that just does some
>> simple things up front:
>> 
>> create an MD5 from all top level blocks of the code. Start by just
>> finding matching blocks of code ('{' and '}', '(' and ')') and
>> remember the source locations for these and their MD5 values. Run a
>> normal diff on the code and see what blocks the diffs fall into. Then
>> try to figure out where things moved by possibly delving deeper into
>> each block that matched something from the diff. Also if any blocks
>> moved to a completely different location, try and figure that out by
>> matching the MD5 of any blocks.
> 
> Ok, I think I see how this works in principle. This is possibly more
> efficient than gumtree, which has O(n^2) worst case complexity where n
> is the number of nodes in the AST. But I am not sure..
> 
>> 
>> For example if you had:
>> 
>> int main(int argc, const char **argv) {
>>  if (argc > 2) {
>>  }
>>  switch (argc) {
>>  }
>> } 
>> 
>> You would first make MD5s for the '(' and ')' in the "main" line and
>> for the '{' at the end of the main line, and ending at the end of the
>> code. Now the code looks like:
>> 
>> int main(int argc, const char **argv) {
>>  switch (argc) {
>>  }
>>  if (argc > 2) {
>>  }
>> } 
>> 
>> The diff would show that the "if" is gone and a new "if" is found
>> after the switch in the new version of the file. We would notice that
>> the diff appears inside the block from the first:
>> 
>> {
>>  if (argc > 2) {
>>  }
>>  switch (argc) {
>>  }
>> } 
>> 
>> And in the block from the second:
>> 
>> {
>>  switch (argc) {
>>  }
>>  if (argc > 2) {
>>  }
>> }
>> 
>> So we would then compute the MD5 for the blocks inside each of these
>> blocks and try to match things up. The MD5 would of course remove
>> spaces that aren't in strings and only compute the MD5 from the
>> characters that make sense. This simple type of approach could almost
>> work on any language without the need to be able to correctly compile
>> each file with all the right options.
> 
> Yeah, this is another advantage, one does not need to basically have
> a compiler in order to create the diff. So this would just consider
> blocks enclosed by parentheses or braces. It is nice that it works for
> any language that uses those for blocks.
> 
> So this would be a more lightweight and smart tool, while the gumtree is
> more powerful and extensible because it operates only on the AST. If the
> goal is to produce a diffing API that is able to exploit the semantics
> of the language and enables tools for visualization and merging then I
> think gumtree is the better choice. On the other hand, if we just want a
> Unix tool that is a smart and lean improvement to diff then your idea
> might be more appropriate in addition to being more flexible.
> 
> What do others think about this?

That seems like a good summary to me.

What I find cool about an AST based tool, is the ability to extract the diff as semantic informations, for example “function (or variable) was renamed from A to B”. 
I feel that you can get a totally different level of information with such a tool.

— 
Mehdi

> 
> 
> [1] https://hal.archives-ouvertes.fr/hal-01054552/document <https://hal.archives-ouvertes.fr/hal-01054552/document>
> 
>> 
>> Greg
>> 
>>> On Mar 20, 2017, at 4:47 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>>> 
>>> (+CC: Greg Clayton who gave me this idea in the first place)
>>> 
>>>> On Mar 20, 2017, at 3:20 PM, Johannes Altmanninger via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> I am currently studying Computer Science at TU Eindhoven. I am doing a
>>>> course that involves programming assignments on parts of LLVM such as
>>>> lowering, scheduling and optimization. For this year's Google Summer of
>>>> Code I plan to submit a proposal to implement a clang-based diff tool
>>>> [1].
>>> 
>>> Great! I look forward to see this :)
>>> 
>>>> 
>>>> I think it really pays off to have decent developer tools available, as
>>>> they can save tons of time. Clang tooling has obviously been very
>>>> successful.  I think it would be a good idea to develop a diff tool that
>>>> considers the structure of the code, as opposed to just the lines. Plain
>>>> old diff only thinks in terms of "additions" and "deletions", although
>>>> it would be more natural to also consider "updates" and "moves".
>>>> 
>>>> So a structural diff would work solely on the AST, hence formatting
>>>> changes are ignored. It would allow to highlight the exact location of a
>>>> change, and not a whole line. Furthermore, it would allow to compare
>>>> pieces of code with the same structure (think subclasses).
>>>> 
>>>> Besides some papers with clever AST-matching algorithms, a quick web
>>>> search yielded [2], which is a proof-of-concept implementation of a
>>>> structural comparison algorithm.  I think it demonstrates rather nicely
>>>> what could be done: movement of chunks of code can be easily traced.
>>>> 
>>>> Anyway, one could make all kinds of nice visualizations using a AST diff
>>>> tool, however, I think the initial focus should probably be on creating
>>>> one with a similar output to traditional diff, with the difference that
>>>> updates and moves are displayed in a easily readable way, which already
>>>> could improve developer productivity and happiness.
>>>> 
>>>> As of now I have one question: The output of the tool is meant just for
>>>> humans to read (and not for actual patching), right?
>>> 
>>> Yes. But we developed software as libraries usually. Practically I expect the main part of the work to write some piece of API that generate an “in-memory” representation of the diff.
>>> 
>>> A tool that is generating a textual-human readable output is likely the first client of this API and is likely critical to be able to functionally test it in the early development. In the future I hope it’d enable other graphical diff client to plug-in, or git-merge resolution tools as well.
>>> 
>>> Best,
>>> 
>>> — 
>>> Mehdi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170323/3372f45e/attachment.html>