[clang] Reland "[Utils] add update-verify-tests.py" (#108630)" (PR #108658)

Tue Sep 17 12:17:37 PDT 2024

hnrklssn wrote:

> As a CFE reviewer, I very much hate the existence of this script (and the CodeGen equivalents, but that is a separate discussion).
> 
> This existing means I can no longer trust that a test is a reflection of what the author intended to write, as they might have just run this and "YOLO'ed" it away. So now I have to be 100x more careful reviewing tests, which means significantly worse review bandwidth. This is the same problem I have with AI generated code/tests, as this is basically only a 1/2 step above, "Hey Google, write me a lit test for this".
> 
> @cor3ntin and @AaronBallman and @Endilll .

That was always true, in the sense that there's nothing preventing the author from doing the exact same thing: looking at the output from the `clang -verify` failure and adding/removing expected diagnostics until it matches. In fact, even without this script that is the most mindless way of doing it, and I'm sure it already happens a lot.

In either way, authors should review their own changes first. I would argue that this script is significantly less harmful than any FileCheck generator script, because given an input program there is much less flexibility in how to structure the checks. With FileCheck you may not want to check every line, since some of them aren't directly relevant to the thing you're testing, instead making the test more fragile and harder to review through bloat. So creating FileCheck check lines is a bit of an art. But for `clang -verify` you have to add checkers for exactly the emitted diagnostics. No more or less. So the only real question is on which line to place them.

That said my main goal is not to make creating tests easier (although it of course also does that, and I don't know that I agree that it's harmful to lower the threshold to create tests), but to make it easier to batch update many/large breaking tests. We have many tests downstream that while testing features that don't yet exist upstream, are affected by new diagnostics added upstream. Handling many tests breaking all at once because of upstream changes is easier if it doesn't require a ton of `sed`-fu. Similar interactions happen when cherry-picking a patch from a dev branch to one or multiple release branches, that may not contain the exact same set of diagnostics. I could of course keep this script downstream, but we are not the only ones with downstream changes and I think others will find it helpful. And that's for a completely new script: keeping something like #108425 downstream only would be result in merge conflicts.

Furthermore, I have observed how existing regression tests create friction to improving phrasing of existing diagnostics. Old diagnostics often appear in many many tests, so while the code change to improving the diagnostic may be quick, you know it'll take significant time to update all of the regression tests using the old phrasing. So you don't do that drive-by improvement, because that wasn't your goal anyways. I think there are a ton of diagnostics in clang that could be improved to be clearer, but when the improvement is minor and friction is high we get stuck with the "good enough".

https://github.com/llvm/llvm-project/pull/108658