[PATCH] D54141: [clang-tidy] add deduplication support for run-clang-tidy.py

Tue Nov 6 08:24:18 PST 2018

JonasToth added a comment.

> - The output of clang-tidy diagnostic is YAML, and YAML is not an space-efficient format (just for human readability). If you want to save space further, we might consider using some compressed formats, e.g. llvm::bitcode. Given the reduced YAML result (5.4MB) is promising, this might not matter.

The output were normal diagnostics written to stdout, deduplication happens from there (see the test-cases). The files i created were just through piping to filter some of the noise.
Without de-duplication its very hard to get something useful out of a run with many checks activated for bigger projects (e.g. Blender and OpenCV are useless to try, because they have some commonly used macros with a check-violation. The buildbot filled 30GB of RAM before it crashed and couldn't even finish the analysis of the project. Similar for LLVM)

> - clang-tidy itself doesn't do deduplication, and `run-clang-tidy.py` seems an old way of running clang-tidy in parallel. The python script seems become more complicated now.  We have `AllTUsToolExecutor` right now, which supports running clang tools on a compilation database in parallel, so another option would be to use `AllTUsToolExecutor` in clang-tidy, and we can do deduplication inside clang-tidy binary (in reduce phase), which should be faster than the python script (spawn new clang-tidy processes and do round-trip of all the data through YAML-on-disk).

Yes, this patch came out of necessity because testing through all available clang-tidy checks for big projects and see if their transformations are incorrect or not was/is just impossible right now with the tools we have upstream.
I agree that `AllTUsToolExecutor` would be better instead of the python script, but i think getting this done takes longer, then just patching the script now. From the patch here (it is an by-default off option as well) it is easier to test all pieces of clang-tidy. From there we can easily migrate to something better then `run-clang-tidy.py´.
The deduplication within clang-tidy would be the best option! But for full deduplication the parallelization must happen first.

> The python script seems become more complicated now.

A bit, yes. The actual calling of clang-tidy and other parts are not touched. Just the parser adds additional complexity, which is covered in the unit tests. I don't think this solution lives for ever, but its fast and effective, and again its optional and by default off.

For context: This is more a spinoff of my attempts of getting statistics of clang-tidy results for a wide range of projects. This parser is the minimal version that can do de-duplication.

Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D54141