[llvm-dev] [GSoC 2017] Clang-based diff tool project
Mehdi Amini via llvm-dev
llvm-dev at lists.llvm.org
Thu Mar 23 12:52:56 PDT 2017
> On Mar 23, 2017, at 12:28 PM, Johannes Altmanninger <aclopte at gmail.com> wrote:
>
> I am currently considering the gumtree algorithm which is described in
> [1]. It is able to detect code moves, updates. There is also a java
> implementation of gumtree. It supports C++ via srcML, however this
> seems to be quite incomplete and buggy. The algorithm consists of a top
> down and a bottom up traversal of the AST that result in a matching of
> corresponding nodes. It contains some heuristics similarity measurement.
>
> Greg Clayton <clayborg at gmail.com <mailto:clayborg at gmail.com>> writes:
>
>> My original idea was to write a semantic diff tool that just does some
>> simple things up front:
>>
>> create an MD5 from all top level blocks of the code. Start by just
>> finding matching blocks of code ('{' and '}', '(' and ')') and
>> remember the source locations for these and their MD5 values. Run a
>> normal diff on the code and see what blocks the diffs fall into. Then
>> try to figure out where things moved by possibly delving deeper into
>> each block that matched something from the diff. Also if any blocks
>> moved to a completely different location, try and figure that out by
>> matching the MD5 of any blocks.
>
> Ok, I think I see how this works in principle. This is possibly more
> efficient than gumtree, which has O(n^2) worst case complexity where n
> is the number of nodes in the AST. But I am not sure..
>
>>
>> For example if you had:
>>
>> int main(int argc, const char **argv) {
>> if (argc > 2) {
>> }
>> switch (argc) {
>> }
>> }
>>
>> You would first make MD5s for the '(' and ')' in the "main" line and
>> for the '{' at the end of the main line, and ending at the end of the
>> code. Now the code looks like:
>>
>> int main(int argc, const char **argv) {
>> switch (argc) {
>> }
>> if (argc > 2) {
>> }
>> }
>>
>> The diff would show that the "if" is gone and a new "if" is found
>> after the switch in the new version of the file. We would notice that
>> the diff appears inside the block from the first:
>>
>> {
>> if (argc > 2) {
>> }
>> switch (argc) {
>> }
>> }
>>
>> And in the block from the second:
>>
>> {
>> switch (argc) {
>> }
>> if (argc > 2) {
>> }
>> }
>>
>> So we would then compute the MD5 for the blocks inside each of these
>> blocks and try to match things up. The MD5 would of course remove
>> spaces that aren't in strings and only compute the MD5 from the
>> characters that make sense. This simple type of approach could almost
>> work on any language without the need to be able to correctly compile
>> each file with all the right options.
>
> Yeah, this is another advantage, one does not need to basically have
> a compiler in order to create the diff. So this would just consider
> blocks enclosed by parentheses or braces. It is nice that it works for
> any language that uses those for blocks.
>
> So this would be a more lightweight and smart tool, while the gumtree is
> more powerful and extensible because it operates only on the AST. If the
> goal is to produce a diffing API that is able to exploit the semantics
> of the language and enables tools for visualization and merging then I
> think gumtree is the better choice. On the other hand, if we just want a
> Unix tool that is a smart and lean improvement to diff then your idea
> might be more appropriate in addition to being more flexible.
>
> What do others think about this?
That seems like a good summary to me.
What I find cool about an AST based tool, is the ability to extract the diff as semantic informations, for example “function (or variable) was renamed from A to B”.
I feel that you can get a totally different level of information with such a tool.
—
Mehdi
>
>
> [1] https://hal.archives-ouvertes.fr/hal-01054552/document <https://hal.archives-ouvertes.fr/hal-01054552/document>
>
>>
>> Greg
>>
>>> On Mar 20, 2017, at 4:47 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>>>
>>> (+CC: Greg Clayton who gave me this idea in the first place)
>>>
>>>> On Mar 20, 2017, at 3:20 PM, Johannes Altmanninger via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I am currently studying Computer Science at TU Eindhoven. I am doing a
>>>> course that involves programming assignments on parts of LLVM such as
>>>> lowering, scheduling and optimization. For this year's Google Summer of
>>>> Code I plan to submit a proposal to implement a clang-based diff tool
>>>> [1].
>>>
>>> Great! I look forward to see this :)
>>>
>>>>
>>>> I think it really pays off to have decent developer tools available, as
>>>> they can save tons of time. Clang tooling has obviously been very
>>>> successful. I think it would be a good idea to develop a diff tool that
>>>> considers the structure of the code, as opposed to just the lines. Plain
>>>> old diff only thinks in terms of "additions" and "deletions", although
>>>> it would be more natural to also consider "updates" and "moves".
>>>>
>>>> So a structural diff would work solely on the AST, hence formatting
>>>> changes are ignored. It would allow to highlight the exact location of a
>>>> change, and not a whole line. Furthermore, it would allow to compare
>>>> pieces of code with the same structure (think subclasses).
>>>>
>>>> Besides some papers with clever AST-matching algorithms, a quick web
>>>> search yielded [2], which is a proof-of-concept implementation of a
>>>> structural comparison algorithm. I think it demonstrates rather nicely
>>>> what could be done: movement of chunks of code can be easily traced.
>>>>
>>>> Anyway, one could make all kinds of nice visualizations using a AST diff
>>>> tool, however, I think the initial focus should probably be on creating
>>>> one with a similar output to traditional diff, with the difference that
>>>> updates and moves are displayed in a easily readable way, which already
>>>> could improve developer productivity and happiness.
>>>>
>>>> As of now I have one question: The output of the tool is meant just for
>>>> humans to read (and not for actual patching), right?
>>>
>>> Yes. But we developed software as libraries usually. Practically I expect the main part of the work to write some piece of API that generate an “in-memory” representation of the diff.
>>>
>>> A tool that is generating a textual-human readable output is likely the first client of this API and is likely critical to be able to functionally test it in the early development. In the future I hope it’d enable other graphical diff client to plug-in, or git-merge resolution tools as well.
>>>
>>> Best,
>>>
>>> —
>>> Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170323/3372f45e/attachment.html>
More information about the llvm-dev
mailing list