[clang] Reland "[Utils] add update-verify-tests.py" (#108630)" (PR #108658)

Tue Sep 17 16:56:41 PDT 2024

Endilll wrote:

> That was always true, in the sense that there's nothing preventing the author from doing the exact same thing: looking at the output from the clang -verify failure and adding/removing expected diagnostics until it matches. In fact, even without this script that is the most mindless way of doing it, and I'm sure it already happens a lot.

An author still has to pay some attention to the diagnostic test they are changing, and they still can notice things. Sure, they might not pick subtleties, but I think there's a fair chance to catch something really wrong. Again, we're constrained on review bandwidth, and not on making changes. As a testament to that, number of active PRs has been constantly going up ever since we switched to them, and we also have a long tail of opened reviews in Phabricator. So this little nudge to review your work helps the worst bottleneck we've been having.

> But for clang -verify you have to add checkers for exactly the emitted diagnostics. No more or less. So the only real question is on which line to place them.

I don't think it's that simple at least for some tests. We have tests that run in `-verify` mode multiple times with different combinations of compiler options. Such tests might rely on custom prefixes (`-verify=prefix1,prefix2`), and updating them automatically can easily yield incorrect result. As an example how more complicated tests might look like, consider C++ DR test suite. Here's one example: 
https://github.com/llvm/llvm-project/blob/ca0613e0fcef2a9972f2802318201f8272e74693/clang/test/CXX/drs/cwg3xx.cpp#L779-L782

> That said my main goal is not to make creating tests easier (although it of course also does that, and I don't know that I agree that it's harmful to lower the threshold to create tests), but to make it easier to batch update many/large breaking tests. We have many tests downstream that while testing features that don't yet exist upstream, are affected by new diagnostics added upstream. Handling many tests breaking all at once because of upstream changes is easier if it doesn't require a ton of sed-fu. Similar interactions happen when cherry-picking a patch from a dev branch to one or multiple release branches, that may not contain the exact same set of diagnostics. I could of course keep this script downstream, but we are not the only ones with downstream changes and I think others will find it helpful. And that's for a completely new script: keeping something like https://github.com/llvm/llvm-project/pull/108425 downstream only would be result in merge conflicts.

I don't intend to dismiss the pain downstream could have over this, but in my opinion it doesn't justify opening floodgates of automated test updates upstream. At the moment I don't have a strong opinion on infrastructure that supports such automated updates, like #108425 you mentioned.

> Furthermore, I have observed how existing regression tests create friction to improving phrasing of existing diagnostics. Old diagnostics often appear in many many tests, so while the code change to improving the diagnostic may be quick, you know it'll take significant time to update all of the regression tests using the old phrasing.

I agree that the pain is there. But again, I don't think the end justifies the mean.

> So you don't do that drive-by improvement, because that wasn't your goal anyways.

Drive-by fix that makes you update a lot of test would introduce a lot of noise into your PR anyway, making it noticeably harder to review. I believe the script addresses the wrong problem here. 

https://github.com/llvm/llvm-project/pull/108658