[llvm] [llvm] Add format check for MCSubtargetFeatures (PR #180943)

Thu Mar 5 07:40:42 PST 2026

DavidSpickett wrote:

> Example string "+attr1,+attr2,-attr3,...,+attrN" in PR description doesn't have trailing comma. I think that is pretty enough justification for recognition of string with trailing comma like invalid strings.

I get that llvm can crash. That is the "what has happened".

This PR is "what I'm doing about it".

The PR description is for "why I have chosen to handle the problem this way".

To me some of these things just seem like regular permissive parsing that I'd implement anywhere else. Handling a trailing comma isn't hard to do. If I tried to make LLDB exit the entire debug session when there was a trailing comma in a GDB remote protocol packet, I doubt it would pass review.

Would it be nice if no debug server ever sent us such packets? Sure. Is LLDB full of bugs where it probably does exit the session? Yeah, plenty :)

(I know that some of the invalid strings are incredibly invalid and would have to be rejected anyway, I'm just picking on the easiest example to illustrate my perspective)

So I hope you can see why this PR seems a bit odd from the outside. Probably the mindset is not the same, so I'm trying to learn what the mindset is here.

What I'd like to see in the PR description is a justification why you've gone with strict validation up front. It doesn't have to be extensive, I just don't want to be seeing PRs that just do things with no indication of why. Especially ones that could be causing downstream problems.

Perhaps it sounds like I'm asking for some giant RFC or whatever, but I'm not so the best way I can show you is to write out what I've assumed from the changes. Feel free to copy or modify any of this if it is what you were also thinking, or agree with in hindsight.

* Incorrect feature strings lead to at best, results inconsistent with caller expectations and at worse, llvm crashes caused by asserts. These are hard to debug and may happen after the string has been processed which makes it hard to track down the original problem. For example `<some string>` is parsed as `<some list of features>` but later causes an assertion failure in `<thing>`.
* Though some "invalid" strings could be handled in an intuitive way, for example trailing commas, there are others that simply cannot be handled. For example features that do not start with + or -. So I decided that being very strict was a better policy, as the format itself is quite simple and shouldn't require complex code to generate correctly.
* This string is so key to LLVM's correct operation, that forcing users to be very precise is, I think, warranted. They should think carefully about what they're asking for.
* Rejecting invalid strings up front moves the error point to the construction of MCSubtargetInfo. This can be more easily found with traceback or a debugger.
* Returning a nullptr is not great, but it is an improvement on crashing llvm. This could be improved but I do not intend to do that in this PR.

(There was similar sort of justification for the llvm runtimes build. The older compiler-rt build would auto detect what to build and it was always a pain to know exactly what you'd get. The runtimes build requires you to define exactly what you want, so you are forced to make choices up front. Which seems like a burden but saves you from much worse problems later.)

And if you're thinking "yeah, that was the point all along" - it is not clear from the PR as it is that this is the way of thinking.

https://github.com/llvm/llvm-project/pull/180943