[PATCH] D59887: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library

Wed Apr 24 08:39:18 PDT 2019

ilya-biryukov added inline comments.

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:59
+  explicit Token(const clang::Token &T);
+
+  tok::TokenKind kind() const { return Kind; }
----------------
sammccall wrote:
> Token should be copyable I think?
Sure, they are copyable. Am I missing something?

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:62
+  SourceLocation location() const { return Location; }
+  SourceLocation endLocation() const {
+    return Location.getLocWithOffset(Length);
----------------
sammccall wrote:
> sadly, this would need documentation - it's the one-past-the-end location, not the last character in the token, not the location of the "next" token, or the location of the "next" character...
> 
> I do wonder whether this is actually the right function to expose...
>  Do you ever care about end but not start? (The reverse seems likelier). Having two independent source location accessors obscures the invariant that they have the same file ID.
> 
> I think exposing a `FileRange` accessor instead might be better, but for now you could also make callers use `Tok.location().getLocWithOffset(Tok.length())` until we know it's the right pattern to encourage
Added a comment. With the comment and an implementation being inline and trivial, I don't think anyone would have trouble understanding what this method does.

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:78
+  /// For debugging purposes.
+  std::string str() const;
+
----------------
sammccall wrote:
> (do we need to have two names for this version?)
You mean to have distinct names for two different overloads?
I wouldn't do it, they both have fairly similar outputs, could add a small comment that the one with SourceManager should probably be preferred if you have one.

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:95
+  /// End offset (exclusive) in a corresponding file.
+  unsigned End = 0;
+};
----------------
sammccall wrote:
> may want to offer `length()` and `StringRef text(const SourceManager&)`
SG.
WDYT of exposing the struct members via getters too?

This would mean uniform access for all things (beginOffset, endOffset, length) and adding a ctor that checks invariants (Begin <= End, File.isValid()) would also fit naturally.

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:111
+///       replacements,
+///    2. Raw tokens: corresponding directly to the source code of a file before
+///       any macro replacements occurred.
----------------
sammccall wrote:
> I think "Spelled" would be a better name than "Raw" here. (Despite being longer and grammatically awkward)
> 
> - The spelled/expanded dichotomy is well-established in clang-land, and this will help understand how they relate to each other and why the difference is important. It's true we'd be extending the meaning a bit (all PP activity, not just macros), but that applies to Expanded too. I think consistency is valuable here.
> - "Raw" regarding tokens has an established meaning (raw lexer mode, raw_identifier etc) that AIUI you're not using here, but plausibly could be. So I think this may cause some confusion.
> 
> There's like 100 occurrences of "raw" in this patch, happy to discuss first to avoid churn.
Spelled looks fine. The only downside is that it intersects with SourceManager's definition of spelled (which is very similar, but not the same).
Still like it more than raw.

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:160
+  llvm::Optional<llvm::ArrayRef<syntax::Token>>
+  findRawByExpanded(llvm::ArrayRef<syntax::Token> Expanded) const;
+
----------------
sammccall wrote:
> sammccall wrote:
> > Not totally clear what the legal arguments are here:
> >  - any sequence of tokens?
> >  - a subrange of expandedTokens() (must point into that array) - most obvious, but in this case why does mapping an empty range always fail?
> >  - any subrange of expandedTokens(), compared by value? - I think this is what the implementation assumes
> > 
> I think `by` is a little weak semantically, I think it just means "the parameter is expanded tokens".
> 
> `findRawForExpanded()` seems clearer.
> `rawTokensForExpanded()` would be clearer still I think.
Yes, `Expanded` should be a subrange of `expandedTokens()`. Added a comment.

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:165
+  llvm::Optional<FileRange>
+  findOffsetsByExpanded(llvm::ArrayRef<syntax::Token> Expanded) const;
+
----------------
sammccall wrote:
> this still seems a bit non-orthogonal:
>  - it fails if and only if findRawByExpanded fails, but that isn't visible to callers
>  - there's no way to avoid the redundant work of doing the mapping twice
>  - the name doesn't indicate that it maps to raw locations as well as translating tokens to locations
>  - seems likely to lead to combinatorial explosion, e.g. if we want a pair of expansion locations to a file range, or expansion locations to raw token range...
> 
> Can this be a free function taking the *raw* token range instead, to be composed with `findRawByExpanded`? (Not sure if it belongs in this file or if it's a test helper)
I've changed this to accept a spelled token sequence, but kept it a member function.
This allows to add an assertion that raw tokens come from a corresponding raw token stream.

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:127
+  assert(FileIt != Files.end() && "no file for an expanded token");
+  // Crossing the file boundaries is not supported at the moment.
+  if (1 < Expanded.size() &&
----------------
sammccall wrote:
> sammccall wrote:
> > This sounds like an implementation limitation, rather than a desired part of the contract.
> > maybe e.g.
> > `// If the tokens in the range don't come from the same file, raw token mapping isn't defined.
> > // Because files span contiguous token ranges, if the first/last token have the same file, so do the ones in between.`
> > 
> > (nit: no "the")
> The part that is an implementation limitation, as discussed offline:
> 
> ```
> #include "bar.h"
> int foo;
> ```
> expands to
> ```
> int bar;
> int foo;
> ```
> but if you try to map those tokens back, to the file it'll fail due to file ID mismatch between first and last token.
> 
> Whereas the following will map backwards and forwards fine:
> ```
> int foo1;
> #include "bar.h";
> int foo2;
> ```
> 
> So I think requiring the fileIDs of begin/end to match is a bug, and instead we should walk up the `#include` stack to the closest common file id (while requiring that the range cover the whole #include).
> 
> No need to implement this yet, but I think the case is worth documenting. 
Done. I've added a fixme, it's not very detailed though (does not have an example), let me know if you feel it's necessary to add one.

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:134
+
+  unsigned BeginIndex = Expanded.begin() - ExpandedTokens.data();
+  unsigned EndIndex = Expanded.end() - ExpandedTokens.data();
----------------
sammccall wrote:
> ah, I missed the pointer calculation here. Add an assert at the top of the function that the arrayref is in-range?
We now have the corresponding assertions in a helper function.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59887/new/

https://reviews.llvm.org/D59887