[PATCH] D128411: [syntax] Introduce a BaseToken class.

Sam McCall via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Mon Jun 27 02:32:00 PDT 2022


sammccall added inline comments.


================
Comment at: clang/include/clang/Tooling/Syntax/TokenManager.h:9
+//
+// Defines Token interfaces for the syntax-tree, decoupling the syntax-tree from
+// the TokenBuffer. It enables producers (e.g. clang pseudoparser) to produce a
----------------
It's important that we have comments explaining what the concept is *now*, rather than how it changes the code structure from the previous state (decouples tokenbuffer, enables pseudoparser).

(I think this comment is actually fine, but be careful when writing the class comment for TokenManager)


================
Comment at: clang/include/clang/Tooling/Syntax/TokenManager.h:33
+  /// The syntax-tree Leaf node holds a Key.
+  using Key = const void *;
+  /// Gets the text of token identified by the key.
----------------
hokein wrote:
> ilya-biryukov wrote:
> > I have just realized that we were discussing having opaque index as a key, but there may also be tokens in the syntax tree that are not backed by the `TokenBuffer`.
> > Those can be synthesized, e.g. imagine someone wants to change `!(A && B)` to `!A || !B`. They will need to synthesize at least the `||` token as it's not in the source code. There is a way to do this now and it prohibits the use of an index to the `TokenBuffer`.
> > 
> > So having the opaque pointer is probably ok for now, it should enable the pseudo-parser to build syntax trees.
> > We might want to add an operation to synthesize tokens into the `TokenManager` at some point, but that's a discussion for another day.
> 
> > Those can be synthesized, e.g. imagine someone wants to change !(A && B) to !A || !B. They will need to synthesize at least the || token as it's not in the source code. There is a way to do this now and it prohibits the use of an index to the TokenBuffer.
> 
> Yes, this is the exact reason why the Key is an opaque pointer, my first attempt was to use an index integer, but failed -- we already have some APIs doing this stuff (see `createLeaf` in BuildTree.h), the token can be a synthesized token backed up by the SourceManager...
> 
> Personally, I don't like the Key to be an opaque pointer as well, but considering the effort, it seems to be the best approach so far -- it enables the pseudoparser to build syntax trees with a different Token implementation while keeping the rest syntax stuff unchanged.
> 
> > We might want to add an operation to synthesize tokens into the TokenManager at some point, but that's a discussion for another day.
> 
> Agree, we will encounter this in the future, but we're still far away from there (the layering mutation/syntax-tree is not perfect at the moment, mutation still depends on the TokenBuffer). And our initial application of syntax-tree in pseudoparser focuses on the read use-case, we should be fine now.
Consider `uintptr_t` instead, which more naturally supports both approaches. (Stashing an integer in a void* is incredibly weird)


================
Comment at: clang/include/clang/Tooling/Syntax/TokenManager.h:37
+
+  // FIXME: add an interface for getting token kind.
+};
----------------
I wouldn't want to prejudge this: this is a very basic attribute similar to kind/role, and we may want to store it in Leaf and avoid the virtual stuff.

There's certainly enough space, e.g. of the current 16-bit `kind` use the top bit to denote leaf-or-not and the bottom 15 bits to store kind-or-tokenkind.


================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:463
+/// It tracks the underlying token buffers, source manager, etc.
+class SyntaxTokenManager : public TokenManager {
+public:
----------------
conceptually this is just "TokenBuffer implements TokenManager"

The main reason I can see not to actually write that is to avoid the dependency from TokenBuffer (tokens library) to TokenManager (syntax library). But here you've added that dependency anyway.

So I think we'd be better either with `TokenBuffer : TokenManager` or moving this class to its own header.


================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:465
+public:
+  SyntaxTokenManager(SourceManager &SourceMgr, const LangOptions &LangOpts,
+                     const TokenBuffer &Tokens)
----------------
no need to take SourceManager, TokenBuffer already includes it.
LangOpts is unused here.


================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:475
+    assert(Token);
+    // Handle 'eof' separately, calling text() on it produces an empty string.
+    if (Token->kind() == tok::eof)
----------------
Empty string seems like the correct return value here to me.
If you want a special case for dump, I think that belongs in dump().

If this is because we currently provide no way to get the token kind, then this should be a FIXME


================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:487
+  const SourceManager &getSourceManager() const { return SM; }
+  const TokenBuffer &getTokenBuffer() const { return Tokens; }
+
----------------
just tokenBuffer(), consistent with TokenBuffer::sourceManager()


================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:490
+private:
+  // This mangaer is powered by the TokenBuffer.
+  static constexpr llvm::StringLiteral Kind = "TokenBuffer";
----------------
manager


================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:504
+  /// IDs and storage for additional tokenized files.
+  llvm::DenseMap<FileID, std::vector<Token>> ExtraTokens;
+};
----------------
aren't we trying to store these on the syntax arena?
We never do lookups into this map, maybe lexbuffer just allocates storage on the arena('s allocator) instead of using this map?


================
Comment at: clang/include/clang/Tooling/Syntax/Tree.h:156
 
-/// A leaf node points to a single token inside the expanded token stream.
 class Leaf final : public Node {
----------------
why is this comment removed?


================
Comment at: clang/lib/Tooling/Syntax/Tree.cpp:271
+      // FIXME: can be fixed by adding an tok::Kind in the Leaf node.
+      // assert(cast<Leaf>(C).getToken()->kind() == L->getDelimiterTokenKind());
     }
----------------
ilya-biryukov wrote:
> hokein wrote:
> > ilya-biryukov wrote:
> > > Maybe add `TokenManager::getKind(Key)` right away and remove this FIXME.
> > > This should as simple as `cast<syntax::Token>(T)->Kind`, right? Or am I missing some complications?
> > Yeah, the main problem is that we don't have the `TokenManager` object in the `syntax::Node`, we somehow need to pass it (e.g. a function parameter), which seems a heavy change. I'm not sure it is worth for this small assert.
> That makes sense. WDYT about the alternative fix: to pass ̀TokenManager` to `assertInvariants`?
> Not necessary to do it now, just thinking about changing the FIXME
per my comment above: Leaf can store the tok::Kind directly and I think it's appropriate to do so.
But maybe fiddly enough that it's worth deferring for one patch


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128411/new/

https://reviews.llvm.org/D128411



More information about the cfe-commits mailing list