[PATCH] D119172: [pseudo] Implement LRGraph

Tue Feb 8 02:01:07 PST 2022

sammccall accepted this revision.
sammccall added a comment.
This revision is now accepted and ready to land.

LG, thank you!
Bunch of nits, up to you what you want to do with them.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:10
+//  LR parsers are bottom-up parsers -- they scan the input from left to right,
+//  and find the right-hand side of a production rule (called handle), and
+//  replace the handle with the nonterminal defined by the production rule.
----------------
I think it'd be useful to be a little more concrete here than "find"...

```
and collect the right-hand side of a production rule (called handle) on top
of the stack, then replace (reduce) the handle with the nonterminal defined
by the production rule.
```

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:42
+// An LR item -- a grammar rule with a dot at some position of the body.
+// e.g. a production rule A := xy yields 3 items:
+//   A := .xy
----------------
The lack of spaces on the RHS is inconsistent with the debug output, and any examples with real grammars (since we need spaces to split up words).

Can we use `A := x y` instead?
Or possibly `A := X Y` which makes the boundaries more obvious when it appears in-line with text.

(I don't have an opinion about `A := X . Y` vs `A := X.Y` (dot *replaces* space), but again we should pick one consistent with the debug output.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:52
+public:
+  Item(RuleID ID, uint8_t DotPos, uint8_t RuleLength)
+      : RID(ID), DotPos(DotPos), RuleLength(RuleLength) {}
----------------
I think the calls to these constructors could be a little clearer as factories that describe their purpose:

e.g.

```
static Item start(RuleID ID, const Grammar &G);
static Item sentinel(uint8_t ID);
Item Item::advance() const;
```

I think that would cover everything?

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:85
+};
+using ItemSet = std::vector<Item>;
+
----------------
this alias is used only once in the header, for Items - it doesn't really introduce a new public concept that's distinct from that field.

I think it would be clearer to write it as std::vector<Item> there, and move the alias to the implementation file.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:101
+  std::string dump(const Grammar &G) const;
+  struct LessItem {
+    LessItem(const Grammar &G) : G(G) {}
----------------
call this "SortByNextSymbol"? or something?

As it is it's hard at a glance to understand why we have operator< and LessItem

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:101
+  std::string dump(const Grammar &G) const;
+  struct LessItem {
+    LessItem(const Grammar &G) : G(G) {}
----------------
sammccall wrote:
> call this "SortByNextSymbol"? or something?
> 
> As it is it's hard at a glance to understand why we have operator< and LessItem
This comparator is pure implementation detail, and can go in the cpp file.
(Still fine to reference in the comment, but I think "in a canonical order" is enough)

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:110
+      // Item with a trailing dot is considered minimal.
+      return L.hasNext() < R.hasNext();
+    }
----------------
this seems to compare items equal if both are dot-at-the-end.

In general the cascade/priority/completeness of comparisons isn't really easy to spot. Consider structuring like:

```
if (L.hasNext() && R.hasNext() && R.Lnext(G) != R.next(G))
  return L.next(G) < R.next(G);
if (L.hasNext() != R.hasNext())
  return L.hasNext() < R.hasNext(); // trailing dot is minimum
return L < R;
```

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:131
+
+  // Constructs an LR(0) automaton.
+  static LRGraph buildLR0(const Grammar &);
----------------
IIUC the data structure used here is inherently LR0, rather than the choice of how we build it.
(Vs in the LRTable where we could come up with data representing a full LR(1) parser without changing the structure).

So I think this function doesn't need LR0 in its name (but the class could be LR0Graph if you like)?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:36
+  llvm::DenseSet<Item> Closure = {IS.begin(), IS.end()};
+  while (!IS.empty()) {
+    auto ProcessingItem = std::move(IS.back());
----------------
This took me a while to understand: you're reusing the ItemSet (passed by value) as the work queue (well, stack) and it's already initialized to the right elements.

This is clever, but please add a comment. Maybe even rename IS->Queue

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:37
+  while (!IS.empty()) {
+    auto ProcessingItem = std::move(IS.back());
+    IS.pop_back();
----------------
nit: auto -> Item?
(Generally if the types aren't hard to write or read, it seems nice to include them)

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:44
+
+    if (pseudo::isToken(NextSym))
+      continue;
----------------
nit: I find the spacing (blank lines) here a bit confusing: this seems part of the previous "paragraph" rather than the following one.

Could use a comment like "Is there a nonterminal next() to expand?"

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:53
+  }
+  ItemSet Sorted = {Closure.begin(), Closure.end()};
+  llvm::sort(Sorted, State::LessItem(G));
----------------
You still have the `IS` array sitting there, you could reuse that again and avoid the allocation if you didn't expand the set too much :-)

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:84
+    for (const Item &I : Batch)
+      Next.push_back(Item(I.rule(), I.dot() + 1, G));
+    // sort the set to keep order determinism for hash computation.
----------------
Maybe call this Item.advance(), to avoid exposing the constructor

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:102
+  };
+  return llvm::formatv("{0} := {1} • {2}\n", G.symbolName(Rule.Target),
+                       llvm::join(ToNames(Rule.seq().take_front(DotPos)), " "),
----------------
Since this is a single-line output, it seems more flexible to put the caller in charge of adding a newline if they want one?

(This would be inconsistent with State::dump of course, but I think that's OK. Maybe we should work out a naming convention at some point)

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:112
+  for (const auto &Item : Items)
+    OS << llvm::formatv("    {0}", Item.dump(G));
+  return OS.str();
----------------
Hardcoding the indentation seems a bit special-purpose. Make the indent level an optional parameter to State::dump() and use OS.indent()?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:126
+  for (const auto &E : Edges) {
+    OS << llvm::formatv("{0}->[{1}]{2}\n", E.Src, G.symbolName(E.Label), E.Dst);
+  }
----------------
nit: more spaces would make this more readable

`{0} ->[{1}] {2}`
or even
`{0} --[{1}]-> {2}`?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:168
+  private:
+    // We store the hash codes, as collisions is rare and ItemSet is heavy.
+    // key is the hash value of **kernel** item sets.
----------------
This really feels like a dubious optimization to me as it's *incorrect* if we ever get a hash collision, and the kernel ItemSet is much smaller than `States` which we're already storing. This seems like it must only save 20% or something of overall size?

In the other review thread you compared the size of DenseMap<hash_code, size_t> vs DenseMap<ItemSet, size_t>, but really the total size including `States` seems like a more useful comparison.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119172/new/

https://reviews.llvm.org/D119172