[PATCH] D124500: [clang-tidy] Support expressions of literals in modernize-macro-to-enum

Mon May 2 13:11:21 PDT 2022

LegalizeAdulthood added inline comments.

================
Comment at: clang-tools-extra/clang-tidy/modernize/IntegralLiteralExpressionMatcher.cpp:99
+
+  if (!Current->isLiteral() || isStringLiteral(Current->getKind()) ||
+      !isIntegralConstant(*Current)) {
----------------
aaron.ballman wrote:
> LegalizeAdulthood wrote:
> > aaron.ballman wrote:
> > > LegalizeAdulthood wrote:
> > > > aaron.ballman wrote:
> > > > > LegalizeAdulthood wrote:
> > > > > > aaron.ballman wrote:
> > > > > > > I know this is code moved from elsewhere, but I suppose we never considered the odd edge case where a user does something like `"foo"[0]` as a really awful integer constant. :-D
> > > > > > It's always possible to write crazy contorted code and have a check not recognize it.  I don't think it's worthwhile to spend time trying to handle torture cases unless we can find data that shows it's prevalent in real world code.
> > > > > > 
> > > > > > If I was doing a code review and saw this:
> > > > > > ```
> > > > > > enum {
> > > > > >     FOO = "foo"[0]
> > > > > > };
> > > > > > ```
> > > > > > I'd flag that in a code review as bogus, whereas if I saw:
> > > > > > ```
> > > > > > enum {
> > > > > >     FOO = 'f'
> > > > > > };
> > > > > > ```
> > > > > > That would be acceptable, which is why character literals are accepted as an integral literal initializer for an enum in this check.
> > > > > >  I don't think it's worthwhile to spend time trying to handle torture cases unless we can find data that shows it's prevalent in real world code.
> > > > > 
> > > > > I think I'm okay agreeing to that in this particular case, but this is more to point out that writing your own parser is a maintenance burden. Users will hit cases we've both forgotten about here, they'll file a bug, then someone has to deal with it. It's very hard to justify to users "we think you write silly code" because they often have creative ways in which their code is not actually so silly, especially when we support "most" valid expressions.
> > > > Writing your own parser is unavoidable here because we can't just assume that any old thing will be a valid initializer just by looking at the set of tokens present in the macro body.  (There is a separate discussion going on about improving the preprocessor support and parsing things more deeply, but that isn't even to the point of a prototype yet.)  The worst thing we can do is create "fixits" that produce invalid code.
> > > > 
> > > > The worst that happens if your expression isn't recognized is that your macro isn't matched as a candidate for an enum.  You can always make it an enum manually and join it with adjacent macros that were recognized and converted.
> > > > 
> > > > As it stands, the check only recognizes a single literal with an optional unary operator.
> > > > 
> > > > This change expands the check to recognize a broad range of expressions, allowing those macros to be converted to enums.  I opened the issue because running modernize-macro-to-enum on the [[ https://github.com/InsightSoftwareConsortium/ITK | ITK codebase ]] showed some simple expressions involving literals that weren't recognized and converted.
> > > > 
> > > > If an expression isn't recognized and an issue is opened, it will be an enhancement request to support a broader range of expression, not a bug that this check created invalid code.
> > > > 
> > > > IMO, the more useful thing that's missing from the grammar is recognizing `sizeof` expressions rather than indexing string literals with an integral literal subscript.
> > > > 
> > > > I had planned on doing another increment to recognize `sizeof` expressions.
> > > > Writing your own parser is unavoidable here because we can't just assume that any old thing will be a valid initializer just by looking at the set of tokens present in the macro body. 
> > > 
> > > If you ran the token sequence through clang's parser and got an AST node out, you'd have significantly *more* information as to whether something is a valid enum constant initializer because you can check that it's an actual constant expression *and* that it's within a valid range of values. This not only fixes edge case bugs with your approach (like the fact that you can generate a series of literal expressions that result in a value too large to store within an enumerator constant), but it enables new functionality your approach currently disallows (like using constexpr variables instead of just numeric literals).
> > > 
> > > So I don't agree that it's unavoidable to write another custom parser.
> > You keep bringing up the idea that the values have to be known, but so far they don't.
> > 
> > Remember, we are replacing macro identifiers with anonymous enum identifiers.  We aren't specifying a restricting type to the enum, so as long as it's a valid integral literal expression, we're not changing any semantics.  Unscoped enums also allow arbitrary conversions to/from an underlying integral type chosen by the compiler.
> > 
> > C++20 9.7.1 paragraph 7 says:
> > 
> > > For an enumeration whose underlying type is not fixed, the underlying type is an integral type that can
> > > represent all the enumerator values defined in the enumeration. If no integral type can represent all the
> > > enumerator values, the enumeration is ill-formed. It is implementation-defined which integral type is used
> > > as the underlying type except that the underlying type shall not be larger than int unless the value of an
> > > enumerator cannot fit in an int or unsigned int . If the enumerator-list is empty, the underlying type is as
> > > if the enumeration had a single enumerator with value 0.
> > 
> > So the compiler is free to pick an underlying type that's large enough to handle all the explicitly listed initial values.  Do we actually need to know the values for this check?  I don't think so, because we aren't changing anything about the type of the named values.  When the compiler evaluates an integral literal, it goes through a similar algorithm assigning the appropriate type to those integral values:
> > 
> > C++20 5.9 paragraph 2:
> > 
> > > A preprocessing number does not have a type or a value; it acquires both after a successful conversion to an
> > > integer-literal token or a floating-point-literal token.
> > 
> > C++20 5.13.2 paragraph 3:
> > 
> > > The type of an integer-literal is the first type in the list in Table 8 corresponding to its optional integer-suffix
> > > in which its value can be represented. 
> > 
> > The table says the type is int, unsigned int, long int, unsigned long int, long long int, or unsigned long long int based on the suffix and the value and that the type is chosen to be big enough to hold the value if the suffix is unspecified.
> > 
> > > but [using `clangParse`] enables new functionality your approach currently disallows (like using constexpr variables instead of just numeric literals).
> > 
> > I agree that if we used the full parser, we'd bring in `constexpr` expressions as valid initializers for the enums.  However, before engaging in all that work, I'd like to see how likely this is in existing codebases by feedback from users requesting the support.  Maybe engaging the parser isn't a big amount of work, I don't actually know.  I've never looked deeply at the actual parsing code in clang.  Maybe it's easy enough to throw a bag of tokens at it and get back an AST node, maybe not.  (I suspect not based on my experience with the code base so far.)
> > 
> > My suspicion is that code bases that are heavy with macros for constants //aren't// using modern C++ in the body of those macros to define the values of those constants.  Certainly this is 100% true for C code that uses macros to define constants, by definition.  This check applies equally well to C code as C has had enums forever but even recent C code still tends to use macros for constants.
> > 
> > Still, my suspicions aren't data.  I'd like to get this check deployed in a basic fashion and let user feedback provide data on what is important.
> > 
> > > So I don't agree that it's unavoidable to write another custom parser.
> > 
> > That's a fair point.  //Some// kind of parser is needed to recognize valid initializer expressions or we run the risk of transforming valid code into invalid code.  Whether it is a custom recognizer as I've done or `clangParse` is what we're debating here.
> > You keep bringing up the idea that the values have to be known, but so far they don't.
> 
> See comments at the top level.
> 
> > So the compiler is free to pick an underlying type that's large enough to handle all the explicitly listed initial values. Do we actually need to know the values for this check? 
> 
> Yes, C requires the enumeration constants to be representable with `int`. But also, because this is in the `modernize` module, it's very likely we'll be getting a request to convert to using a scoped enumeration or an enumeration with the appropriate fixed underlying type in C++ as well.
Oh, I see now, thanks for explaining it.  I didn't realize that C restricts the values to `int`.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124500/new/

https://reviews.llvm.org/D124500