[PATCH] D124500: [clang-tidy] Support expressions of literals in modernize-macro-to-enum

Mon May 2 05:16:48 PDT 2022

aaron.ballman added a comment.

In D124500#3483328 <https://reviews.llvm.org/D124500#3483328>, @LegalizeAdulthood wrote:

> In D124500#3483224 <https://reviews.llvm.org/D124500#3483224>, @aaron.ballman wrote:
>
>> To clarify my previous comments, I'm fine punting on the edge cases until user reports come in, so don't let them block this review if you feel strongly about not supporting them. But when user reports start coming in, at some point I might start asking to replace the custom parser with calling into the clangParse library through some invented utility interface so that we don't have to deal with a long tail of bug reports.
>
> Yeah, I think punting on edge cases is the right thing to do here.  As I say,
> the worst that happens is your macro isn't converted automatically when you
> could convert it manually.

I largely agree, but I've found cases where we'll convert correct code to incorrect code, so it's a bit stronger than that.

> Maybe we're thinking about this check differently.
>
> I want this check to do the majority of the heavy lifting so that I'm only left with a
> few "weird" macros that I might have to convert by hand.  I never, ever, ever want
> this check to generate invalid code.  Therefore it is of paramount importance that
> it be conservative about what it recognizes as a candidate for conversion.

I think that's a reasonable goal, but we're not meeting the "never ever generate invalid code" part. I already know we can break correct C and C++ code through overflow. Should we ever allow an option to use an enum with a fixed underlying type in C++ (either `enum class` or `enum : type` form), we'll have the same breakage there but at different thresholds.

> I think this is the baseline for all the modernize checks, really.  There are still cases
> where loop-convert doesn't recognize an iterator based loop that could be
> converted to a range-based for loop.  The most important thing is that loop-convert
> doesn't take my weird iterator based loop and convert it to a range based for loop
> that doesn't compile.
>
> As for calling into `clangParse`, I think that would be overkill for a couple reasons.
> First, the real parser is going to do a lot of work creating AST nodes which we will
> never use, except to traverse the structure looking for things that would invalidate
> it as a candidate for a constant initializing expression.  Second, we only need to
> match the structure, we don't need to extract any information from the token stream
> other than a "thumbs up" or "thumbs down" that it is a valid initializing expression.

There's a few reasons I disagree with this. First, you need to know the value of the constant expression in order to know whether it's valid as an enumeration constant. That alone requires expression evaluation capabilities, assuming you want the check to behave correctly for those kinds of cases. But second, without support for generating that AST and validating it, you can never handle cases like this:

  constexpr int a = 12;
  constexpr int foo() { return 12; }

  #define FOO (a + 1)
  #define BAR (a + 2)
  #define BAZ (a + 3)
  #define QUUX (foo() + 4)

because you have no way to know whether or not that constant expression is valid based solely on token soup (especially when you start to factor in namespaces, qualified lookup, template instantiations, etc). So I think avoiding the parser will limit the utility of this check. And maybe that's fine, maybe it's only ever intended to convert C code using defines to C code using enumerations or other such simple cases.

> Many times in clang-tidy reviews performance concerns are raised and I think
> matching the token stream with the recursive descent matcher here is going to be
> much, much faster than invoking `clangParse`, particularly since the matcher bails
> out early on the first token that doesn't match.

Absolutely 100% agreed on this point.

> The only thing I can think of that
> would make it faster is if we could get the lexed tokens from the preprocessor
> instead of making it re-lex the macro body, but that's a change beyond the scope
> of this check or even clang-tidy.

Yeah, and that wouldn't even be sufficient because I still think we're going to want to know the *value* of the expression at some point.

Allllll that "this is the wrong way!" doom and gloom aside, I still think you're fine to proceed with the current approach if you'd like to. The situations under which it will break correct code should be something we document explicitly in the check somewhere, but they feel sufficiently like edge cases to me that I'm fine moving forward with the caution in mind that if this becomes too much more problematic in the future, we have *a* path forward we could use to improve things should we decide we need to.

================
Comment at: clang-tools-extra/clang-tidy/modernize/IntegralLiteralExpressionMatcher.cpp:99
+
+  if (!Current->isLiteral() || isStringLiteral(Current->getKind()) ||
+      !isIntegralConstant(*Current)) {
----------------
LegalizeAdulthood wrote:
> aaron.ballman wrote:
> > LegalizeAdulthood wrote:
> > > aaron.ballman wrote:
> > > > I know this is code moved from elsewhere, but I suppose we never considered the odd edge case where a user does something like `"foo"[0]` as a really awful integer constant. :-D
> > > It's always possible to write crazy contorted code and have a check not recognize it.  I don't think it's worthwhile to spend time trying to handle torture cases unless we can find data that shows it's prevalent in real world code.
> > > 
> > > If I was doing a code review and saw this:
> > > ```
> > > enum {
> > >     FOO = "foo"[0]
> > > };
> > > ```
> > > I'd flag that in a code review as bogus, whereas if I saw:
> > > ```
> > > enum {
> > >     FOO = 'f'
> > > };
> > > ```
> > > That would be acceptable, which is why character literals are accepted as an integral literal initializer for an enum in this check.
> > >  I don't think it's worthwhile to spend time trying to handle torture cases unless we can find data that shows it's prevalent in real world code.
> > 
> > I think I'm okay agreeing to that in this particular case, but this is more to point out that writing your own parser is a maintenance burden. Users will hit cases we've both forgotten about here, they'll file a bug, then someone has to deal with it. It's very hard to justify to users "we think you write silly code" because they often have creative ways in which their code is not actually so silly, especially when we support "most" valid expressions.
> Writing your own parser is unavoidable here because we can't just assume that any old thing will be a valid initializer just by looking at the set of tokens present in the macro body.  (There is a separate discussion going on about improving the preprocessor support and parsing things more deeply, but that isn't even to the point of a prototype yet.)  The worst thing we can do is create "fixits" that produce invalid code.
> 
> The worst that happens if your expression isn't recognized is that your macro isn't matched as a candidate for an enum.  You can always make it an enum manually and join it with adjacent macros that were recognized and converted.
> 
> As it stands, the check only recognizes a single literal with an optional unary operator.
> 
> This change expands the check to recognize a broad range of expressions, allowing those macros to be converted to enums.  I opened the issue because running modernize-macro-to-enum on the [[ https://github.com/InsightSoftwareConsortium/ITK | ITK codebase ]] showed some simple expressions involving literals that weren't recognized and converted.
> 
> If an expression isn't recognized and an issue is opened, it will be an enhancement request to support a broader range of expression, not a bug that this check created invalid code.
> 
> IMO, the more useful thing that's missing from the grammar is recognizing `sizeof` expressions rather than indexing string literals with an integral literal subscript.
> 
> I had planned on doing another increment to recognize `sizeof` expressions.
> Writing your own parser is unavoidable here because we can't just assume that any old thing will be a valid initializer just by looking at the set of tokens present in the macro body. 

If you ran the token sequence through clang's parser and got an AST node out, you'd have significantly *more* information as to whether something is a valid enum constant initializer because you can check that it's an actual constant expression *and* that it's within a valid range of values. This not only fixes edge case bugs with your approach (like the fact that you can generate a series of literal expressions that result in a value too large to store within an enumerator constant), but it enables new functionality your approach currently disallows (like using constexpr variables instead of just numeric literals).

So I don't agree that it's unavoidable to write another custom parser.

================
Comment at: clang-tools-extra/clang-tidy/modernize/IntegralLiteralExpressionMatcher.cpp:204
+  return true;
+}
+
----------------
LegalizeAdulthood wrote:
> LegalizeAdulthood wrote:
> > aaron.ballman wrote:
> > > LegalizeAdulthood wrote:
> > > > aaron.ballman wrote:
> > > > > Comma operator?
> > > > Remember that the use case here is identifying expressions that are initializers for an enum.  If you were doing a code review and you saw this:
> > > > ```
> > > > enum {
> > > >     FOO = (2, 3)
> > > > };
> > > > ```
> > > > Would you be OK with that?  I wouldn't.  Clang even warns about it: https://godbolt.org/z/Y641cb8Y9
> > > > 
> > > > Therefore I deliberately left comma operator out of the grammar.
> > > This is another case where I think you're predicting that users won't be using the full expressivity of the language and we'll get bug reports later. Again, in insolation, I tend to agree that I wouldn't be happy seeing that code. However, users write some very creative code and there's no technical reason why we can't or shouldn't handle comma operators.
> > "Don't let the perfect be the enemy of the good."
> > 
> > My inclination is to simply explicitly state that comma operator is not recognized in the documentation.  It's already implicit by it's absence from the list of recognized operators.
> > 
> > Again, the worst that happens is that your macro isn't converted.
> > 
> > I'm open to being convinced that it's important, but you haven't convinced me yet `:)`
> It wasn't much extra work/code to add comma operator support so I've done that.
> "Don't let the perfect be the enemy of the good."

This is a production compiler toolchain. Correctness is important and that sometimes means caring more about perfection than you otherwise would like to.

> I'm open to being convinced that it's important, but you haven't convinced me yet :)

It's less about importance and more about maintainability coupled with correctness. With your approach, we get something that will have a long tail of bugs. If you used Clang's parser, you don't get the same issue -- maintenance largely comes along for free, and the bugs are far less likely.

About the only reason I like your current approach over using clang's parsing is that it quite likely performs much better than doing an actual token parsing of the expression. But as you pointed out, about the worst thing for a check can do is take correct code and make it incorrect -- doing that right requires some amount of semantic evaluation of the expressions (which you're not doing). For example:
```
#define FINE 1LL << 30LL;
#define BAD 1LL << 31LL;
#define ALSO_BAD 1LL << 32L;
```
We'll convert this into an enumeration and break `-pedantic-errors` builds in C. If we had a `ConstantExpr` object, we could see what it's value is and note that it's greater than what fits into an `int` and decide to do something smarter.

So I continue to see the current approach as being somewhat reasonable (especially for experimentation), but incorrect in the long run. Not sufficiently incorrect for me to block this patch on, but incorrect enough that the first time this check becomes a maintenance burden, I'll be asking more strongly to do this the correct way.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124500/new/

https://reviews.llvm.org/D124500