[clang] clang: Relax LangOpts checks when lexing quoted numbers during preprocessing (PR #95798)

Thu Nov 21 12:59:36 PST 2024

================
@@ -2068,7 +2068,8 @@ bool Lexer::LexNumericConstant(Token &Result, const char *CurPtr) {
   }
 
   // If we have a digit separator, continue.
-  if (C == '\'' && (LangOpts.CPlusPlus14 || LangOpts.C23)) {
+  if (C == '\'' &&
+      (LangOpts.CPlusPlus14 || LangOpts.C23 || ParsingPreprocessorDirective)) {
----------------
jansvoboda11 wrote:

> I feel like another issue is that just accepting separators uncoditionally in preprocessor directives (i.e. even outside of dependency scanning) would technically be a language extension... doesn’t that mean that we’d technically have to add a compatibility warning? Because we’re usually pretty strict about doing that for things that are effectively a compiler extension.

Sure, but we don't have to expose that language extension to users, it can remain internal to the compiler.

> Also, I’m not really that familiar w/ the dependency scanner, but I’m still confused as to why passing down the `LangOpts` isn’t an option: in #93753, it was pointed out that that would impair caching the results of the scan—but isn’t it already the case that e.g. what files something depends on is dependent on the language standard? E.g. what if I write:
> 
> ```c++
> #if __cplusplus >= WhateverTheValueForC++17IsAgain
> #    include "foo.hh"
> #else
> #    include "bar.hh"
> #endif
> ```
> 
> Does it just note that the file might depend on both `foo.hh` and `bar.hh`? Because if so, can’t we just set the language standard in the lang opts for the dependency scanner to the latest e.g. C++ standard? Because that would also fix this issue.

The scanner works in two steps:

1. The virtual file system "minimizes" source files into sequence of tokens that are relevant for dependency discovery. Your code fragment would be transformed into something like `[if, ident, gte, ident, then, include, string, else, include, string, endif]`. 
2. The scanner instance takes those tokens and the TU `LanguageOptions` and runs full preprocessor on those tokens. This means that the condition should evaluate correctly.

Having (1) be context-free is important for performance.

On high level, the goal of #93753 is to respect `LanguageOptions` in the scanner. My argument is that we should be able to do so by representing tokens in the stream produced by (1) such that we're still able to emit context-aware, `LanguageOptions`-driven diagnostics in (2).

That solves the issue outlined by #93753 without impacting the performance.

> I mean, we don’t need a language option for _every_ feature either, but adding them on an as-needed basis doesn’t seem completely untenable.

+1

https://github.com/llvm/llvm-project/pull/95798