[clang] [Clang] Implement P3034R1 Module Declarations Shouldn’t be Macros (PR #90574)

Thu May 9 09:37:33 PDT 2024

================
@@ -1329,6 +1341,100 @@ bool Preprocessor::LexAfterModuleImport(Token &Result) {
   return true;
 }
 
+/// Lex a token following the 'module' contextual keyword.
+///
+/// [cpp.module]/p2:
+/// The pp-tokens, if any, of a pp-module shall be of the form:
+///       pp-module-name pp-module-partition[opt] pp-tokens[opt]
+///
+/// where the pp-tokens (if any) shall not begin with a ( preprocessing token
+/// and the grammar non-terminals are defined as:
+///       pp-module-name:
+///             pp-module-name-qualifierp[opt] identifier
+///       pp-module-partition:
+///             : pp-module-name-qualifier[opt] identifier
+///       pp-module-name-qualifier:
+///             identifier .
+///             pp-module-name-qualifier identifier .
+/// No identifier in the pp-module-name or pp-module-partition shall currently
+/// be defined as an object-like macro.
+///
+/// [cpp.module]/p3:
+/// Any preprocessing tokens after the module preprocessing token in the module
+/// directive are processed just as in normal text.
+bool Preprocessor::LexAfterModuleDecl(Token &Result) {
+  // Figure out what kind of lexer we actually have.
+  recomputeCurLexerKind();
+  LexUnexpandedToken(Result);
+
+  auto EnterTokens = [this](ArrayRef<Token> Toks, bool DisableMacroExpansion) {
+    auto ToksCopy = std::make_unique<Token[]>(Toks.size());
+    std::copy(Toks.begin(), Toks.end(), ToksCopy.get());
+    EnterTokenStream(std::move(ToksCopy), Toks.size(), DisableMacroExpansion,
+                     /*IsReinject=*/false);
+  };
+
+  // If we don't expect an identifier but got an identifier, it's not a part of
+  // module name.
+  if (!ModuleDeclExpectsIdentifier && Result.is(tok::identifier)) {
+    EnterTokens(Result, /*DisableMacroExpansion=*/false);
+    return false;
+  }
+
+  // The token sequence
+  //
+  // export[opt] module identifier (. identifier)*
+  //
+  // indicates a module directive. We already saw the 'module'
+  // contextual keyword, so now we're looking for the identifiers.
+  if (ModuleDeclExpectsIdentifier && Result.is(tok::identifier)) {
+    auto *MI = getMacroInfo(Result.getIdentifierInfo());
+    if (MI && MI->isObjectLike()) {
+      Diag(Result, diag::err_module_decl_cannot_be_macros)
+          << Result.getLocation() << ModuleDeclLexingPartitionName
+          << Result.getIdentifierInfo();
+    }
+    ModuleDeclExpectsIdentifier = false;
+    CurLexerCallback = CLK_LexAfterModuleDecl;
+    return true;
+  }
+
+  // If we're expecting a '.', a ':' or a ';', and we got a '.', then wait until
+  // we see the next identifier.
+  if (!ModuleDeclExpectsIdentifier && Result.isOneOf(tok::period, tok::colon)) {
+    ModuleDeclExpectsIdentifier = true;
+    ModuleDeclLexingPartitionName = Result.is(tok::colon);
+    CurLexerCallback = CLK_LexAfterModuleDecl;
+    return true;
+  }
+
+  // [cpp.module]/p2: where the pp-tokens (if any) shall not begin with a (
+  // preprocessing token [...]
+  if (!ModuleDeclExpectsIdentifier && Result.is(tok::l_paren)) {
+    ModuleDeclExpectsIdentifier = false;
+    Diag(Result, diag::err_unxepected_paren_in_module_decl)
+        << ModuleDeclLexingPartitionName;
+    Token Tok;
+    // We already have a '('.
+    unsigned NumParens = 1;
+    while (true) {
+      LexUnexpandedToken(Tok);
+      if (Tok.isOneOf(tok::eod, tok::eof, tok::semi, tok::period, tok::colon)) {
+        EnterTokens(Tok, /*DisableMacroExpansion=*/true);
+        break;
+      }
+      if (Tok.is(tok::l_paren))
+        NumParens++;
+      else if (Tok.is(tok::r_paren) && --NumParens == 0)
+        break;
+    }
+    CurLexerCallback = CLK_LexAfterModuleDecl;
+    return false;
+  }
+
+  return true;
+}
+
----------------
cor3ntin wrote:

What is the motivation to put the logic in the preprocessor ? It does adds quite a bit of complexity to the state machine.
But I struggle to make up my mind on what the better approach is.

I kept coming back to the idea of using an annotation

- Introduce a new `cxx_module_name` token annotation in `TokenKind.defs`
- Parse the full module name in the preprocessor
- Store it in  a `Preprocessor` or `Lexer` member (we need 2 `SmallVectorImpl<std::pair<IdentifierInfo *, SourceLocation>>`, one for the partition, one for the name)
- The value of that annotation would be a pointer to that pair of vector
- `ParseModuleName` would handle `cxx_module_name` by moving the content of the vector.
(It w

- That way `LexAfterModuleDecl` can just produce a single token with a fully parsed module name and we avoid having the logic spread in multiple places with complex state machines.


---

Alternatively, and perhaps much simpler,  lex the whole module name + partition at once here do all the checks and diagnostics and reinject all of these tokens all at once.

... actually, that the second solution should work nicely, right?


WDYT?


https://github.com/llvm/llvm-project/pull/90574