[PATCH] D130337: [pseudo] Eliminate multiple-specified-types ambiguities using guards

Fri Jul 22 05:45:09 PDT 2022

hokein added a comment.

Thanks, the change looks good in general.

> we handle *all* rules for the interesting node types explicitly, rather than default: return false. This allows us to assert that all cases are handled, so things don't "fall through the cracks" after grammar changes. Alternative suggestions welcome. (I have a feeling this "exhaustive pattern-matching" idea will come up again...)

The dangerous bit is that now we will crash at the runtime if the parsing code triggers a missing case.

Yeah, I hit this problem in the previous function-declarator patch. One approach will be to define an enum for each nonterminal `enum Nonterminal { rhs0_rhs1 }`, rather than put all rules into a single enum. It is easier to enumerate all values for a single nonterminal (I think this is the common case)

  namespace cxx {
  namespace rule {

  enum simple_type_specifier {
    builtin_type0,
    nested_name_specifier0_type_name1,
    ...
  }

  }
  }

In D130337#3671159 <https://reviews.llvm.org/D130337#3671159>, @sammccall wrote:

> FWIW, as-is with no caching, this is a ~2% slowdown on my machine (5.82 -> 5.72 MB/s on SemaCodeComplete.cpp).

Actually, the number looks quite good to me (I expected that there will be a significant slowdown without caching). I think we can check in the current version, and add the caching stuff afterwards.

> Whereas D130150 <https://reviews.llvm.org/D130150> using the grammar is a a 7% speedup (5.82 -> 6.22), so roughly an 9% performance difference between the approaches.

Yeah, because the grammar-based approach reduces the number of ambiguous node in the forest.

================
Comment at: clang-tools-extra/pseudo/lib/cxx/CXX.cpp:284
+           defining_type_specifier_seq_0defining_type_specifier_1defining_type_specifier_seq,
+       GUARD(!hasExclusiveType(P.RHS.front()) ||
+             !hasExclusiveType(P.RHS.back()))},
----------------
nit: I would suggest using the index explicitly `P.RHS[0]`, `P.RHS[1]`, it increases the readability (the rul name encoding the index, easier to spot the corresponding element).

================
Comment at: clang-tools-extra/pseudo/lib/cxx/cxx.bnf:353
 decl-specifier-seq := decl-specifier
-decl-specifier-seq := decl-specifier decl-specifier-seq
+decl-specifier-seq := decl-specifier decl-specifier-seq [guard]
 storage-class-specifier := STATIC
----------------
offtopic comment: The sad bit of the `RuleID` approach (vs `guard=SingleType`) is that we don't really know what kind of the guard is by reading the grammar file only.

I think this is critical information, and worth to keep in the grammar file. (ideas: add comments, or bring the `guard=SingleType` in the grammar again, but we ignore the `guard` value in the grammar parsing).

================
Comment at: clang-tools-extra/pseudo/lib/cxx/cxx.bnf:385
+builtin-type := BOOL
 simple-type-specifier := SHORT
+builtin-type := INT
----------------
I think the reason to leave SHORT/LONG/SIGNED UNSIGNED as-is is that they can combined with other type (e.g. short int).

Can we group them together, and add a comment?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D130337/new/

https://reviews.llvm.org/D130337