[PATCH] D54141: [clang-tidy] add deduplication support for run-clang-tidy.py

Wed Nov 7 04:13:45 PST 2018

hokein added a comment.

>> If you're suggesting proceeding with this regex based solution, I
> 
> don't think that's a good idea. Why commit a hack which people will object to ever removing? Just see if we can do the right thing instead.

+1, my main concern is the complexity of the patch and maintenance burden of the python script.

In https://reviews.llvm.org/D54141#1288811, @JonasToth wrote:

> > - The output of clang-tidy diagnostic is YAML, and YAML is not an space-efficient format (just for human readability). If you want to save space further, we might consider using some compressed formats, e.g. llvm::bitcode. Given the reduced YAML result (5.4MB) is promising, this might not matter.
>
> The output were normal diagnostics written to stdout, deduplication happens from there (see the test-cases). The files i created were just through piping to filter some of the noise.
>  Without de-duplication its very hard to get something useful out of a run with many checks activated for bigger projects (e.g. Blender and OpenCV are useless to try, because they have some commonly used macros with a check-violation. The buildbot filled 30GB of RAM before it crashed and couldn't even finish the analysis of the project. Similar for LLVM).
>
> I would like to try the simple deduplication first and see if space is still an issue. After all I want to just read the diagnostic and see whats happening instantly and a more compressed format might not help there.

I misthought that the output was the `-export-fixes`, but what you mean is the stdout of clang-tidy.

Could you please explain your motivation of catching clang-tidy stdout? `--export-fixes` emits everything of `diagnostic` to YAML even the `diagnostic` doesn't have fixes. I guess the reason is that you want code snippets that you could show to users? If so, I think this is a separate UX problem, since we have everything in the emitted YAML, and we could construct whatever messages we want from it. A simpler approach maybe:

1. run clang-tidy in parallel on whole project, and emits a deduplicated result (`fixes.yaml`).
2. run a postprocessing in your buildbot that constructs diagnostic messages from `fixes.yaml`, and store it somewhere.
3. do whatever you want with output from 1) and 2).

Step 1 could be done in upstream, probably via `AllTUsExecutor`, and deduplication can be done on the fly based on `<CheckName>::<FilePath>::<FileOffset>`; we still need `clang-apply-replacement` to deduplicate replacements; I'm happy to help with this. Step 2 could be done by your own, just a simple script.

> At the moment clang-apply-replacements is called at the end of an clang-tidy run in run-clang-tidy.py That means we produce ~GBs of Yaml first, to then emit ~10MBs worth of it.

That's why I suggest using some sort of other space-efficient formats to store the fixes. My intuition is that the final deduplicated result shouldn't be too large (even for YAML), because 1) no duplication 2) these are **actual diagnostics** in code, a healthy codebase shouldn't contain lots of problem 3) you have mentioned that you use it for small projects :)

Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D54141