[PATCH] D119172: [pseudo] Implement LRGraph
Sam McCall via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Tue Feb 8 02:01:07 PST 2022
sammccall accepted this revision.
sammccall added a comment.
This revision is now accepted and ready to land.
LG, thank you!
Bunch of nits, up to you what you want to do with them.
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:10
+// LR parsers are bottom-up parsers -- they scan the input from left to right,
+// and find the right-hand side of a production rule (called handle), and
+// replace the handle with the nonterminal defined by the production rule.
----------------
I think it'd be useful to be a little more concrete here than "find"...
```
and collect the right-hand side of a production rule (called handle) on top
of the stack, then replace (reduce) the handle with the nonterminal defined
by the production rule.
```
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:42
+// An LR item -- a grammar rule with a dot at some position of the body.
+// e.g. a production rule A := xy yields 3 items:
+// A := .xy
----------------
The lack of spaces on the RHS is inconsistent with the debug output, and any examples with real grammars (since we need spaces to split up words).
Can we use `A := x y` instead?
Or possibly `A := X Y` which makes the boundaries more obvious when it appears in-line with text.
(I don't have an opinion about `A := X . Y` vs `A := X.Y` (dot *replaces* space), but again we should pick one consistent with the debug output.
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:52
+public:
+ Item(RuleID ID, uint8_t DotPos, uint8_t RuleLength)
+ : RID(ID), DotPos(DotPos), RuleLength(RuleLength) {}
----------------
I think the calls to these constructors could be a little clearer as factories that describe their purpose:
e.g.
```
static Item start(RuleID ID, const Grammar &G);
static Item sentinel(uint8_t ID);
Item Item::advance() const;
```
I think that would cover everything?
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:85
+};
+using ItemSet = std::vector<Item>;
+
----------------
this alias is used only once in the header, for Items - it doesn't really introduce a new public concept that's distinct from that field.
I think it would be clearer to write it as std::vector<Item> there, and move the alias to the implementation file.
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:101
+ std::string dump(const Grammar &G) const;
+ struct LessItem {
+ LessItem(const Grammar &G) : G(G) {}
----------------
call this "SortByNextSymbol"? or something?
As it is it's hard at a glance to understand why we have operator< and LessItem
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:101
+ std::string dump(const Grammar &G) const;
+ struct LessItem {
+ LessItem(const Grammar &G) : G(G) {}
----------------
sammccall wrote:
> call this "SortByNextSymbol"? or something?
>
> As it is it's hard at a glance to understand why we have operator< and LessItem
This comparator is pure implementation detail, and can go in the cpp file.
(Still fine to reference in the comment, but I think "in a canonical order" is enough)
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:110
+ // Item with a trailing dot is considered minimal.
+ return L.hasNext() < R.hasNext();
+ }
----------------
this seems to compare items equal if both are dot-at-the-end.
In general the cascade/priority/completeness of comparisons isn't really easy to spot. Consider structuring like:
```
if (L.hasNext() && R.hasNext() && R.Lnext(G) != R.next(G))
return L.next(G) < R.next(G);
if (L.hasNext() != R.hasNext())
return L.hasNext() < R.hasNext(); // trailing dot is minimum
return L < R;
```
================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRGraph.h:131
+
+ // Constructs an LR(0) automaton.
+ static LRGraph buildLR0(const Grammar &);
----------------
IIUC the data structure used here is inherently LR0, rather than the choice of how we build it.
(Vs in the LRTable where we could come up with data representing a full LR(1) parser without changing the structure).
So I think this function doesn't need LR0 in its name (but the class could be LR0Graph if you like)?
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:36
+ llvm::DenseSet<Item> Closure = {IS.begin(), IS.end()};
+ while (!IS.empty()) {
+ auto ProcessingItem = std::move(IS.back());
----------------
This took me a while to understand: you're reusing the ItemSet (passed by value) as the work queue (well, stack) and it's already initialized to the right elements.
This is clever, but please add a comment. Maybe even rename IS->Queue
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:37
+ while (!IS.empty()) {
+ auto ProcessingItem = std::move(IS.back());
+ IS.pop_back();
----------------
nit: auto -> Item?
(Generally if the types aren't hard to write or read, it seems nice to include them)
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:44
+
+ if (pseudo::isToken(NextSym))
+ continue;
----------------
nit: I find the spacing (blank lines) here a bit confusing: this seems part of the previous "paragraph" rather than the following one.
Could use a comment like "Is there a nonterminal next() to expand?"
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:53
+ }
+ ItemSet Sorted = {Closure.begin(), Closure.end()};
+ llvm::sort(Sorted, State::LessItem(G));
----------------
You still have the `IS` array sitting there, you could reuse that again and avoid the allocation if you didn't expand the set too much :-)
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:84
+ for (const Item &I : Batch)
+ Next.push_back(Item(I.rule(), I.dot() + 1, G));
+ // sort the set to keep order determinism for hash computation.
----------------
Maybe call this Item.advance(), to avoid exposing the constructor
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:102
+ };
+ return llvm::formatv("{0} := {1} • {2}\n", G.symbolName(Rule.Target),
+ llvm::join(ToNames(Rule.seq().take_front(DotPos)), " "),
----------------
Since this is a single-line output, it seems more flexible to put the caller in charge of adding a newline if they want one?
(This would be inconsistent with State::dump of course, but I think that's OK. Maybe we should work out a naming convention at some point)
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:112
+ for (const auto &Item : Items)
+ OS << llvm::formatv(" {0}", Item.dump(G));
+ return OS.str();
----------------
Hardcoding the indentation seems a bit special-purpose. Make the indent level an optional parameter to State::dump() and use OS.indent()?
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:126
+ for (const auto &E : Edges) {
+ OS << llvm::formatv("{0}->[{1}]{2}\n", E.Src, G.symbolName(E.Label), E.Dst);
+ }
----------------
nit: more spaces would make this more readable
`{0} ->[{1}] {2}`
or even
`{0} --[{1}]-> {2}`?
================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRGraph.cpp:168
+ private:
+ // We store the hash codes, as collisions is rare and ItemSet is heavy.
+ // key is the hash value of **kernel** item sets.
----------------
This really feels like a dubious optimization to me as it's *incorrect* if we ever get a hash collision, and the kernel ItemSet is much smaller than `States` which we're already storing. This seems like it must only save 20% or something of overall size?
In the other review thread you compared the size of DenseMap<hash_code, size_t> vs DenseMap<ItemSet, size_t>, but really the total size including `States` seems like a more useful comparison.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D119172/new/
https://reviews.llvm.org/D119172
More information about the cfe-commits
mailing list