[PATCH] D54141: [clang-tidy] add deduplication support for run-clang-tidy.py

Thu Nov 8 04:02:37 PST 2018

JonasToth added a comment.

> Could you please explain your motivation of catching clang-tidy stdout? `--export-fixes` emits everything of `diagnostic` to YAML even the `diagnostic` doesn't have fixes. I guess the reason is that you want code snippets that you could show to users? If so, I think this is a separate UX problem, since we have everything in the emitted YAML, and we could construct whatever messages we want from it.

A bit for pragmatic reasons and a bit precaution.

- You are right with the code-snippet. I want to check for false-positives in new clang-tidy checks, if I can just scroll through and see the code-snippet in question it is just practical.
- diagnostics in template-code might emit multiple warnings at the same code-position. There is a realistic chance that `warning: xy happened here` will be the same for all template-instantiations, and only the `note: type 'XY' does not match` gives the differentiating hint. If dedup happens _ONLY_ on the first warning I fear we might loose valid diagnostics! I did re-evaluate and it seems that the emitted yaml does not include the notes. That is an issues, for example the CSA output relies on the emitted `notes` that explain the path to the bug.
- I originally implemented it for my buildbot which parses the check-name, location and so on and then gives an ordered output for each check in a module and so on. I extracted the essence for deduplication. `-export-fixes` still emits the clang-tidy diagnostics, so for my concrete use-case YAML based de-duplication brings no value in its current form, as my BB still struggles with the amount of stdout.

> 1. run clang-tidy in parallel on whole project, and emits a deduplicated result (`fixes.yaml`).
> 2. run a postprocessing in your buildbot that constructs diagnostic messages from `fixes.yaml`, and store it somewhere.
> 3. do whatever you want with output from 1) and 2).
> 
>   Step 1 could be done in upstream, probably via `AllTUsExecutor`, and deduplication can be done on the fly based on `<CheckName>::<FilePath>::<FileOffset>`; we still need `clang-apply-replacement` to deduplicate replacements; I'm happy to help with this. Step 2 could be done by your own, just a simple script.
> 
>> At the moment clang-apply-replacements is called at the end of an clang-tidy run in run-clang-tidy.py That means we produce ~GBs of Yaml first, to then emit ~10MBs worth of it.
> 
> That's why I suggest using some sort of other space-efficient formats to store the fixes. My intuition is that the final deduplicated result shouldn't be too large (even for YAML), because 1) no duplication 2) these are **actual diagnostics** in code, a healthy codebase shouldn't contain lots of problem 3) you have mentioned that you use it for small projects :)

To 3) I do use it for all kinds of projects, LLVM and Blender are currently the biggest ones. I want to go for LibreOffice, Chromium and so on as well. But right now the amount of noise is the biggest obstacle. My goal is not to check if the project is healthy/provide a service for the project, but check if _we_ have bugs in our checks and if code-transformation is correct, false positives, too much output, misleading messages, ...

To 2) LLVM is very chatty as well, I don't consider LLVM to be a bad code-base. Take `readability-braces-around-statements` for example. I want to test if the check transform all possible places correctly, LLVM does not follow this style and LLVM has a lot of big headers that implement functionality that are transitively included a lot. LLVM is the one that overflowed my 32GB of RAM :)

To 1) I do agree and the data presented support that. I suspect that Yaml-to-stdout Ratio is maybe 2/3:1? So in the analyzed case we end up with 10-15MB of data instead of ~600MB(all Yaml). Space optimization is something we can tackle after-wards as it does not seem to be pathological after deduplication.

In general: I somewhat consider this patch as rejected, I will still use it locally for my BB, but I think this revision should be closed. We could move the discussion here to the mailing-list if you want. It is ok to continue here as well, as we already started to make plans :)
My opinion is, that we should put as much of the deduplication into clang-tidy itself and not rely on tools like `run-clang-tidy.py` if we can.

So for me step 1. would be providing `AllTUsExecutor` in clang-tidy and make it parallel itself. For dedup we need hook the diagnostics. CSA has the `BugReport` class that could be hashed. clang-tidy currently doesn't have this, maybe a similar approach (or the same?) would help us out.

Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D54141