[PATCH] D126536: [pseudo] Add grammar annotations support.

Tue May 31 11:32:21 PDT 2022

sammccall added a comment.

Nice!

This has tests for the parsing-the-attribute bits, but I think we're missing tests for the actual guards added.

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/GLR.h:114
+using ReduceGuard =
+    std::function<bool(RuleID RID, llvm::ArrayRef<const ForestNode *> RHS,
+                       const TokenStream &Tokens, const Grammar &G)>;
----------------
this signature seems a little off to me.

Guard logic is identified by a guard ID and we look it up, but we're not passing in the guard ID for some reason. Instead we pass in the rule ID, which this function uses to look up the guard ID again. Why not just pass the guard ID?

That said, it's a bit surprising that we have the rules declare separate guard rules, but then they're all implemented by one function. A map of functions seems more natural. (but not a performance advantage I guess)

Naming: it's confusing that this umbrella function is called "guard", but a specific rule like "guard=Override" is also called "guard". I'd either call this a GuardTester or bypass the issue by making this a DenseMap whose values are Guards.

Altogether I might write this as:
```
using Guard = llvm::function_ref<bool(llvm::ArrayRef<const ForestNode *> RHS, const TokenStream &, const Grammar &)>;
...
const DenseMap<AttributeID, Guard> &Guards;
```

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/GLR.h:129
+
+  ReduceGuard Guard; // Guard for reducing.
 };
----------------
either a reference, or a lightweight reference type like function_ref, for consistency with other fields?

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h:22
 //
+//  Attributes have the syntax form of [guard=value;key=value], they are
+//  associated with a grammar symbol (on the right-hand side of the symbol) or
----------------
nit: first "guard" should be "key"?

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h:87
+// Defines the built-in attribute keys.
+enum class AttributeKey : uint8_t {
+  // A guard controls whether a reduction of a rule will be conducted by the GLR
----------------
hokein wrote:
> new names are welcome. Attribute is the name I came up with (I think it is clearer than the original `Hook`), 
I don't understand why we need AttributeKey, and to parse things into it, rather than just using StringRef.

(Also here you've put it in the header, but it's not used in any interfaces)

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h:94
+// It is the index into a table of attribute values.
+// NOTE: value among attributes must be unique even within different keys!
+using AttributeID = uint16_t;
----------------
hokein wrote:
> I'm not quite happy with using the value as the ID, I think we can encode the Key into the ID as well (ID := Key | Value). 
> 
> Similar to the generated enum name, currently we just use the name of Value (`Override`), it will be more confusing when we add more keys/values, one idea is to add key as well (`GuardOverride` etc?).
Packing all the possible values into a single unstructured enum is weird, but the problem (if it's a problem) is _redundancy_ and packing the attribute name in there as well just makes that worse.

If we want to fix this, I think a sounder fix is something like:
```
// Grammar.h
using AttributeEnum = uint8_t;
struct Rule { ...; AttributeEnum Guard = 0; }
// CXX.h
enum class GuardID : AttributeEnum { Override = 1; };
enum class RecoveryID : AttributeEnum { Parens = 1; };
```

i.e. we keep track of the unique values per attribute name and emit an enum for each attribute. This probably means emitting the actual enum definitions in tablegen, rather than a generic .inc file.

(But I'm also fine with not fixing it at this point, and keeping one flat enum for the values) 

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h:95
+// NOTE: value among attributes must be unique even within different keys!
+using AttributeID = uint16_t;
+// A sentinel attributeID indicating no attributes, by default.
----------------
I agree HookID is an unfortunately vague name, but I did consider and reject AttributeID which I think is actively confusing.
In the plain-english meaning, "guard" is an attribute, and "override" is its value. So assigning "override" an "attribute ID" is very confusing.

The difficulty of naming this is that we can't refer to the semantics (e.g. "Guard") as we want a single namespace for all of them, and all they have in common is syntax. However referring to "attribute values" makes for confusing internal data structures, as we don't model the whole attribute generically.

So we're left with names that suggest "this is some kind of anchor to attach custom code to". Maybe `ExtensionID` is a little clearer than `HookID`?

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h:97
+// A sentinel attributeID indicating no attributes, by default.
+static constexpr AttributeID NoneAttribute = 0;
+
----------------
If we're going to use implicit bool conversions, we should not pretend this is some arbitrary value and drop the constant I think.

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/Grammar.h:110
+  Rule(SymbolID Target, llvm::ArrayRef<SymbolID> Seq,
+       AttributeID Guard = NoneAttribute);

----------------
if this optional and rarely set, I think it's should be a constructor parameter - this constructor could become unwieldy.

It also forces you to process the attributes before creating the Rule, and reading the code doing it afterwards seems more natural.

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/cxx/CXX.h:39
 #define NONTERMINAL(X, Y) X = Y,
+#define ATTRIBUTE(X, Y)
 #include "CXXSymbols.inc"
----------------
this pattern is fragile. (and we're going to end up adding rules, too)

I'd suggest either:
 - generate a separate file for each enum
 - generate a file containing the full definiton of all the enums (we're not using the generality)
 - add the fallback #defines to the .inc file so the caller doesn't have to provide them all

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/cxx/CXX.h:63
+// will die).
+bool guard(RuleID RID, llvm::ArrayRef<const ForestNode *> RHS,
+           const TokenStream &Tokens, const Grammar &G);
----------------
"guard" looks like a verb here. If we're going to use it as both noun and verb, the relationship between the two needs to match the metaphor (in this case, it needs to be the guard who is doing the guarding) and it's not clear that it is.

I think safest to stick to `const Guard &getGuard()` or so.

================
Comment at: clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp:47

     // Assemble the name->ID and ID->nonterminal name maps.
     llvm::DenseSet<llvm::StringRef> UniqueNonterminals;
----------------
This patch takes this function from 80->120 lines, and it's starting to feel unwieldy.

I think moving building the UniqueNonterminals/SymbolIDs + the new maps into a function (which returns a struct of SymbolIds and AttrValueIDs) would make this more manageable.

================
Comment at: clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp:51
+
+    llvm::DenseSet<llvm::StringRef> UniqueAttrKeys;
+    llvm::DenseSet<llvm::StringRef> UniqueAttrValues = {""};
----------------
unused

================
Comment at: clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp:52
+    llvm::DenseSet<llvm::StringRef> UniqueAttrKeys;
+    llvm::DenseSet<llvm::StringRef> UniqueAttrValues = {""};
+    llvm::DenseMap<llvm::StringRef, AttributeID> AttrValueIDs;
----------------
why ""?

================
Comment at: clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp:62
+    auto ConsiderAttrValue = [&](llvm::StringRef AttrValueName) {
+      if (!AttrValueIDs.count(AttrValueName))
+        UniqueAttrValues.insert(AttrValueName);
----------------
AttrValueIDs never seems to get populated, so I'm not sure this works.

================
Comment at: clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp:119
+
+        for (const auto& AttrKeyAndValue : Elt.Attributes) {
+          switch(AttrKeyAndValue.first) {
----------------
This doesn't seem worth diagnosing to me - it seems stylistically weird but not really a problem to put a guard wherever.

================
Comment at: clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp:119
+
+        for (const auto& AttrKeyAndValue : Elt.Attributes) {
+          switch(AttrKeyAndValue.first) {
----------------
sammccall wrote:
> This doesn't seem worth diagnosing to me - it seems stylistically weird but not really a problem to put a guard wherever.
maybe another function to pull out here?
applyAttribute(StringRef Name, StringRef Value, AttributeID ValueID, Rule&, Element&)

================
Comment at: clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp:254
         continue; // skip empty
+      if (Chunk.startswith("[") && Chunk.endswith("]")) {
+        if (Out.Sequence.empty()) {
----------------
Again, I don't think we should be aiming to provide nice diagnostics for each possible way we could get this wrong - the grammar here is part of the pseudo-parser, not user input.

================
Comment at: clang-tools-extra/pseudo/lib/grammar/GrammarBNF.cpp:288
+
+      Diagnostics.push_back(
+          llvm::formatv("Unknown attribute key '{0}'", KeyAndValue.first)
----------------
(This seems like a condition we can diagnose when applying the attribute, instead of needing to do it eagerly here)

================
Comment at: clang-tools-extra/pseudo/unittests/GLRTest.cpp:161
       {GSSNode1, Action::reduce(ruleFor("enum-name"))}};
-  glrReduce(PendingReduce, {*G, Table, Arena, GSStack},
+  glrReduce(PendingReduce, TokenStream(), {*G, Table, Arena, GSStack},
             captureNewHeads());
----------------
These empty streams aren't valid (in multiple ways: they're not finalized, and they don't contain the tokens that the nodes refer to).

Maybe add a tokenStreamForTest(StringLiteral) function somewhere? Safe because guaranteed to be null-terminated and have sufficient lifetime...

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D126536/new/

https://reviews.llvm.org/D126536