[PATCH] D130626: [pseudo] experiment

Wed Jul 27 15:06:56 PDT 2022

hokein added a comment.

The solution in imperfect here (it is based on the identifier content, so it won't work if the code happen to have two same-content-but-different-kind tokens, e.g. `trace::Span Span;`), but it at least shows that ambiguities in identifiers are the most critical one.

And it gives some evidences that if we know all type information of identifiers, we could potentially get a perfect parse tree even without any soft disambiguation mechanism. I think It might affect our designs in some scenarios -- for example, if we have a fresh clang AST, we can first annotate all identifier tokens (what does this identifier refer to, a class, a enum etc), and use this approach to build a forest in the pseudoparser (we might not need a real disambiguation, because the forest likely ends up with a single perfect tree:)).

Had some offline discussion with @ilya-biryukov today:

If we look at all ambiguities in `ASTSignals.cpp`, 

  45 Ambiguous nodes:
     18 type-name
     14 simple-type-specifier
      5 postfix-expression
      3 namespace-name
      3 nested-name-specifier
      1 relational-expression
      1 template-argument

Most of ambiguities (>90%) are just "local", they won't affect the structure of the tree, and they seem to be less useful. If we think about the final output clang syntax-tree, we care about tree structures, and these ambiguities basically provide zero value. 
For example,  the nested-name-specifier `trace::span` case, in the clang syntax-tree we model the `trace` specifier as a general identifier name specifier regardless whether the `trace` is a type-name or namespace-name;
the simple-declaration `ASTSignals Signals;` case, it is sufficient to know it is a simple-declaration, and `ASTSignals` is a simple-type-specifier, but whether the simple-type-specifier is type-name (thus class-name, enum-name, typedef-name) or template-name is less interesting, and we probably don't want to distinguish them in the clang syntax-tree;
similar to the postfix-expression `Foo(...);` case, we might use the same node in the clang syntax-tree to model a function-call and an explicit class type conversion.

So one option will be to eliminate these "local" ambiguities in the forest (by replacing the type-name, class-name, enum-name, typedef-name, template-name with a generic `name`), as it won't affect tree-structure. Re the implementation, we can do a post-process on the forest -- replace an ambiguous forest node if all its alternatives share the same tree structure, ad-hoc targeting on type-name, simple-type-specifier, postfix-expression nonterminals is probably enough. An alternative is to adjust the cxx grammar rules (not sure how intrusive the change is);

The only "real" ambiguity is the `dyn_cast<NamespaceDecl>(ND->getDeclContext())` (whether it is a postfix expression, or a pair of comparison expressions). This is a real ambiguity in C++ that requires type information to resolve. For these ambiguities, we can't eliminate them, and we do need a ranking-based disambiguation.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D130626/new/

https://reviews.llvm.org/D130626