[clang] clang: Relax LangOpts checks when lexing quoted numbers during preprocessing (PR #95798)

Tue Nov 12 11:58:46 PST 2024

================
@@ -2068,7 +2068,8 @@ bool Lexer::LexNumericConstant(Token &Result, const char *CurPtr) {
   }
 
   // If we have a digit separator, continue.
-  if (C == '\'' && (LangOpts.CPlusPlus14 || LangOpts.C23)) {
+  if (C == '\'' &&
+      (LangOpts.CPlusPlus14 || LangOpts.C23 || ParsingPreprocessorDirective)) {
----------------
Sirraide wrote:

I feel like another issue is that just accepting separators uncoditionally in preprocessor directives (i.e. even outside of dependency scanning) would technically be a language extension... doesn’t that mean that we’d technically have to add a compatibility warning? Because we’re usually pretty strict about doing that for things that are effectively a compiler extension.

Also, I’m not really that familiar w/ the dependency scanner, but I’m still confused as to why passing down the `LangOpts` isn’t an option: in #93753, it was pointed out that that would impair caching the results of the scan—but isn’t it already the case that e.g. what files something depends on is dependent on the language standard? E.g. what if I write:
```c++
#if __cplusplus >= WhateverTheValueForC++17IsAgain
#    include "foo.hh"
#else
#    include "bar.hh"
#endif
```

Does it just note that the file might depend on both `foo.hh` and `bar.hh`? Because if so, can’t we just set the language standard in the lang opts for the dependency scanner to the latest e.g. C++ standard? Because that would also fix this issue.

If we really can’t pass down the original lang opts for whatever reason, then I’m in favour of adding a language option either for `DependencyScanning` or for just literal separators.

> I don't think per-feature LangOptions is viable. There's digit separators, hex float support, binary literal support, different literal suffixes, various keywords, etc and that doesn't seem likely to scale well.

I mean, we don’t need a language option for *every* feature either, but adding them on an as-needed basis doesn’t seem completely untenable. We already have, like, over 300 of them, so I feel like we’re past the point where adding more of them would really make that big of a difference.

That said, I can’t seem to decide which of the following options would be the best (other than passing down the original lang opts):
1. Adding a language option for literal separators (or for an individual feature in general) solves the problem of a ‘dependency scanner’ flag potentially being confusing—and, more generally speaking, also has the nice side-effect that you would no longer have to remember the full list of language dialects which support any one feature—except that as Aaron pointed out, I don’t think a dependency scanner flag would be that confusing (‘language dialect X (which dependency scanning would be at that point) needs to do a weird thing’ isn’t really that unusual imo), and also, you still need to know what dialect(s) introduce certain features so you know when to emit compatibility diagnostics.
2. Adding a language option for dependency scanning means we only need a single one to account for this and however many other problems we might run into wrt language dialects causing a problem w/ dependency scanning, but it also means that now every place that checks whether a feature is supported (possibly only in the lexer/preprocessor; again, I don’t know the dependency scanner that well...) also needs to check if we’re performing dependency scanning.

Either way, I’d definitely prefer either option over just... allowing this in preprocessor directives irrespective of the language dialect.

https://github.com/llvm/llvm-project/pull/95798