[PATCH] D59887: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library

Mon Apr 29 00:55:28 PDT 2019

sammccall accepted this revision.
sammccall marked an inline comment as done.
sammccall added a comment.
This revision is now accepted and ready to land.

Rest is details only.

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:59
+  explicit Token(const clang::Token &T);
+
+  tok::TokenKind kind() const { return Kind; }
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > Token should be copyable I think?
> Sure, they are copyable. Am I missing something?
No. I think I got confused by the explicit clang::Token constructor.

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:62
+  SourceLocation location() const { return Location; }
+  SourceLocation endLocation() const {
+    return Location.getLocWithOffset(Length);
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > sadly, this would need documentation - it's the one-past-the-end location, not the last character in the token, not the location of the "next" token, or the location of the "next" character...
> > 
> > I do wonder whether this is actually the right function to expose...
> >  Do you ever care about end but not start? (The reverse seems likelier). Having two independent source location accessors obscures the invariant that they have the same file ID.
> > 
> > I think exposing a `FileRange` accessor instead might be better, but for now you could also make callers use `Tok.location().getLocWithOffset(Tok.length())` until we know it's the right pattern to encourage
> Added a comment. With the comment and an implementation being inline and trivial, I don't think anyone would have trouble understanding what this method does.
(this is ok. I do think a FileRange accessor would make client code more readable/less error-prone. Let's discuss offline)

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:78
+  /// For debugging purposes.
+  std::string str() const;
+
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > (do we need to have two names for this version?)
> You mean to have distinct names for two different overloads?
> I wouldn't do it, they both have fairly similar outputs, could add a small comment that the one with SourceManager should probably be preferred if you have one.
> 
No sorry, I meant do we need both str() and operator<<

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:95
+  /// End offset (exclusive) in a corresponding file.
+  unsigned End = 0;
+};
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > may want to offer `length()` and `StringRef text(const SourceManager&)`
> SG.
> WDYT of exposing the struct members via getters too?
> 
> This would mean uniform access for all things (beginOffset, endOffset, length) and adding a ctor that checks invariants (Begin <= End, File.isValid()) would also fit naturally.
> 
that sounds fine to me, though I don't feel strongly

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:75
+
+  /// For debugging purposes. More verbose than the other overload, but requries
+  /// a source manager.
----------------
there is no "other overload"

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:77
+
+std::string TokenBuffer::dumpForTests() const {
+  auto PrintToken = [this](const syntax::Token &T) -> std::string {
----------------
Nit: I'd suggest moving this all the way to the bottom or something? It's pretty big and seems a bit "in the way" here.

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:130
+      OS << llvm::formatv(
+          "    ['{0}'_{1}, '{2}'_{3}) => ['{4}'_{5}, '{6}'_{7})\n",
+          PrintToken(File.SpelledTokens[M.BeginSpelled]), M.BeginSpelled,
----------------
As predicted :-) I think these `_<index>` suffixes are a maintenance hazard.

In practice, the assertions are likely to add them by copy-pasting the test output.

They guard against a class of error that doesn't seem very likely, and in practice they don't even really prevent it (because of the copy/paste issue).

I'd suggest just dropping them, and living with the test assertions being slightly ambiguous.

Failing that, some slightly trickier formatting could give more context:

`A B C D E F` --> `A B  ... E F`

With special cases for smaller numbers of tokens. I don't like the irregularity of that approach though.

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:156
+  // Find the first mapping that produced tokens after \p Expanded.
+  auto It = llvm::upper_bound(
+      File.Mappings, ExpandedIndex,
----------------
if you prefer, this is

```
auto It = llvm::bsearch(File.Mappings,
                        [&](const Mapping &M) { return M.BeginExpanded >= Expanded; });
```
I *think* this is clear enough that you can drop the comment, though I may be biased.

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:94
+TokenBuffer::expandedToRaw(const syntax::Token *Expanded) const {
+  assert(Expanded);
+  assert(ExpandedTokens.data() <= Expanded &&
----------------
why not just take a ref?

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:100
+      SourceMgr->getFileID(SourceMgr->getExpansionLoc(Expanded->location())));
+  assert(FileIt != Files.end() && "no file for an expanded token");
+
----------------
you might want to move the token -> fileID calculation to a helper function (or method on token) called by `findRawByExpanded`, and then put this function onto `MarkedFile`.

Reason is, computing the common #include ancestor (and therefore the file buffer to use) can't live in this function, which only sees one of the endpoints.

But this can be deferred until that case is handled...

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:108
+      File.Mappings, ExpandedIndex,
+      [](unsigned L, const Mapping &R) { return L < R.BeginExpandedToken; });
+  // Our token could only be produced by the previous mapping.
----------------
(renaming ExpandedIndex -> L seems confusing, use same name or just capture?)

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:171
+  FileID FID = SourceMgr->getFileID(BeginRawToken->location());
+  // FIXME: Handle multi-file changes by trying to map onto a common root.
+  if (FID != SourceMgr->getFileID(LastRawToken->location()))
----------------
nit: "include root" or "include ancestor"?

================
Comment at: clang/unittests/Tooling/Syntax/TokensTest.cpp:622
+
+// FIXME: add tests for mapping spelled tokens into offsets.
+
----------------
please fix :-)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59887/new/

https://reviews.llvm.org/D59887