[PATCH] D59887: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library

Tue Apr 16 07:59:53 PDT 2019

sammccall added inline comments.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:146
+  llvm::Optional<std::pair<unsigned, unsigned>>
+  toOffsetRange(const Token *Begin, const Token *End,
+                const SourceManager &SM) const;
----------------
ilya-biryukov wrote:
> ilya-biryukov wrote:
> > sammccall wrote:
> > > I'd think this would map a character range onto a character range, or a token range onto a token range - is there some reason otherwise?
> > No particular reason, this was driven by its usage in function computing replacements (not in this patch, see [[ https://github.com/ilya-biryukov/llvm-project/blob/syntax/clang/lib/Tooling/Syntax/ComputeReplacements.cpp | github prototype ]], if interested, but keep in mind the prototype is not ready for review yet).
> > 
> > Mapping to a range of "raw" tokens seems more consistent with our model, will update accordingly, obtaining a range is easy after one has tokens from the file.
> Added a method to map from an expanded token range to a raw token range.
> Kept the method that maps an expanded token range to a character range too.
> 
> Note that we don't currently have a way to map from raw token range to expanded, that could be added later, we don't need it for the initial use-case (map the ranges coming from AST nodes into raw source ranges), so I left it out for now.
> Note that we don't currently have a way to map from raw token range to expanded, that could be added later, we don't need it for the initial use-case (map the ranges coming from AST nodes into raw source ranges), so I left it out for now.

That seems fine, can we add a comment somewhere (class header)?
Not exactly a FIXME, but it would clarify that this is in principle a bidirectional mapping, just missing some implementation.

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:48
+namespace syntax {
+class TokenBuffer;
+
----------------
not needed

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:59
+  explicit Token(const clang::Token &T);
+
+  tok::TokenKind kind() const { return Kind; }
----------------
Token should be copyable I think?

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:62
+  SourceLocation location() const { return Location; }
+  SourceLocation endLocation() const {
+    return Location.getLocWithOffset(Length);
----------------
sadly, this would need documentation - it's the one-past-the-end location, not the last character in the token, not the location of the "next" token, or the location of the "next" character...

I do wonder whether this is actually the right function to expose...
 Do you ever care about end but not start? (The reverse seems likelier). Having two independent source location accessors obscures the invariant that they have the same file ID.

I think exposing a `FileRange` accessor instead might be better, but for now you could also make callers use `Tok.location().getLocWithOffset(Tok.length())` until we know it's the right pattern to encourage

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:78
+  /// For debugging purposes.
+  std::string str() const;
+
----------------
(do we need to have two names for this version?)

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:95
+  /// End offset (exclusive) in a corresponding file.
+  unsigned End = 0;
+};
----------------
may want to offer `length()` and `StringRef text(const SourceManager&)`

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:111
+///       replacements,
+///    2. Raw tokens: corresponding directly to the source code of a file before
+///       any macro replacements occurred.
----------------
I think "Spelled" would be a better name than "Raw" here. (Despite being longer and grammatically awkward)

- The spelled/expanded dichotomy is well-established in clang-land, and this will help understand how they relate to each other and why the difference is important. It's true we'd be extending the meaning a bit (all PP activity, not just macros), but that applies to Expanded too. I think consistency is valuable here.
- "Raw" regarding tokens has an established meaning (raw lexer mode, raw_identifier etc) that AIUI you're not using here, but plausibly could be. So I think this may cause some confusion.

There's like 100 occurrences of "raw" in this patch, happy to discuss first to avoid churn.

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:133
+  /// point to one of these tokens.
+  /// FIXME: the notable exception is '>>' being split into two '>'. figure out
+  ///        how to deal with it.
----------------
It's not clear from this comment what the current behavior *is*, and how it's exceptional.
(It's a FIXME, but these sometimes last a while...)

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:133
+  /// point to one of these tokens.
+  /// FIXME: the notable exception is '>>' being split into two '>'. figure out
+  ///        how to deal with it.
----------------
sammccall wrote:
> It's not clear from this comment what the current behavior *is*, and how it's exceptional.
> (It's a FIXME, but these sometimes last a while...)
nit: unclear from comment whether this is the only exception or just the only notable one

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:140
+  /// Attempt to find the subrange of raw tokens that produced the corresponding
+  /// \p Expanded tokens. Will fail if the raw tokens cannot be determined
+  /// unambiguously. E.g. for the following example:
----------------
"cannot be determined unambiguously" is a little vague: "doesn't exactly correspond to a sequence of raw tokens"?

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:160
+  llvm::Optional<llvm::ArrayRef<syntax::Token>>
+  findRawByExpanded(llvm::ArrayRef<syntax::Token> Expanded) const;
+
----------------
Not totally clear what the legal arguments are here:
 - any sequence of tokens?
 - a subrange of expandedTokens() (must point into that array) - most obvious, but in this case why does mapping an empty range always fail?
 - any subrange of expandedTokens(), compared by value? - I think this is what the implementation assumes

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:160
+  llvm::Optional<llvm::ArrayRef<syntax::Token>>
+  findRawByExpanded(llvm::ArrayRef<syntax::Token> Expanded) const;
+
----------------
sammccall wrote:
> Not totally clear what the legal arguments are here:
>  - any sequence of tokens?
>  - a subrange of expandedTokens() (must point into that array) - most obvious, but in this case why does mapping an empty range always fail?
>  - any subrange of expandedTokens(), compared by value? - I think this is what the implementation assumes
> 
I think `by` is a little weak semantically, I think it just means "the parameter is expanded tokens".

`findRawForExpanded()` seems clearer.
`rawTokensForExpanded()` would be clearer still I think.

================
Comment at: clang/include/clang/Tooling/Syntax/Tokens.h:165
+  llvm::Optional<FileRange>
+  findOffsetsByExpanded(llvm::ArrayRef<syntax::Token> Expanded) const;
+
----------------
this still seems a bit non-orthogonal:
 - it fails if and only if findRawByExpanded fails, but that isn't visible to callers
 - there's no way to avoid the redundant work of doing the mapping twice
 - the name doesn't indicate that it maps to raw locations as well as translating tokens to locations
 - seems likely to lead to combinatorial explosion, e.g. if we want a pair of expansion locations to a file range, or expansion locations to raw token range...

Can this be a free function taking the *raw* token range instead, to be composed with `findRawByExpanded`? (Not sure if it belongs in this file or if it's a test helper)

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:120
+TokenBuffer::findRawByExpanded(llvm::ArrayRef<syntax::Token> Expanded) const {
+  // Mapping an empty range is not well-defined, bail out in that case.
+  if (Expanded.empty())
----------------
can you explain why it's not? it *almost* is

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:127
+  assert(FileIt != Files.end() && "no file for an expanded token");
+  // Crossing the file boundaries is not supported at the moment.
+  if (1 < Expanded.size() &&
----------------
This sounds like an implementation limitation, rather than a desired part of the contract.
maybe e.g.
`// If the tokens in the range don't come from the same file, raw token mapping isn't defined.
// Because files span contiguous token ranges, if the first/last token have the same file, so do the ones in between.`

(nit: no "the")

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:127
+  assert(FileIt != Files.end() && "no file for an expanded token");
+  // Crossing the file boundaries is not supported at the moment.
+  if (1 < Expanded.size() &&
----------------
sammccall wrote:
> This sounds like an implementation limitation, rather than a desired part of the contract.
> maybe e.g.
> `// If the tokens in the range don't come from the same file, raw token mapping isn't defined.
> // Because files span contiguous token ranges, if the first/last token have the same file, so do the ones in between.`
> 
> (nit: no "the")
The part that is an implementation limitation, as discussed offline:

```
#include "bar.h"
int foo;
```
expands to
```
int bar;
int foo;
```
but if you try to map those tokens back, to the file it'll fail due to file ID mismatch between first and last token.

Whereas the following will map backwards and forwards fine:
```
int foo1;
#include "bar.h";
int foo2;
```

So I think requiring the fileIDs of begin/end to match is a bug, and instead we should walk up the `#include` stack to the closest common file id (while requiring that the range cover the whole #include).

No need to implement this yet, but I think the case is worth documenting. 

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:134
+
+  unsigned BeginIndex = Expanded.begin() - ExpandedTokens.data();
+  unsigned EndIndex = Expanded.end() - ExpandedTokens.data();
----------------
ah, I missed the pointer calculation here. Add an assert at the top of the function that the arrayref is in-range?

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:149
+  }
+  // Find the last mapping that intersects with our range.
+  auto LastCall = llvm::lower_bound(File.Mappings, EndIndex,
----------------
eeee std::lower_bound...

as discussed offline, can reduce to one call with a helper?
Something like:
`pair<const Token*, const Mapping*> MarkedFile::rawForExpanded(const Token&)`

This would bsearch to find the relevant mapping.
(If the token isn't part of any mapping, you can find it by index arithmetic anchored on the next mapping/eof - so the second bsearch isn't necessary)

Then you can call this twice for the first/last token (gaining symmetry by converting to an open range). This yields a token range, and the returned mappings can be verified (whole mapping should be covered).

================
Comment at: clang/lib/Tooling/Syntax/Tokens.cpp:233
+///   expansions.
+class TokenCollector::Callbacks : public PPCallbacks {
+public:
----------------
can we add a FIXME for the things that aren't recorded?
 - includes
 - (other) preprocessor directives
 - pp skipped sections

================
Comment at: clang/unittests/Tooling/Syntax/TokensTest.cpp:1
+//===- TokensTest.cpp -----------------------------------------------------===//
+//
----------------
A few high level things discussed offline:

- can we more clearly separate out tests of the token collector (testing the **value** of the token buffer returned) vs those of the tokenbuffer (testing the tokenbuffer's behavior)
- the token collector tests might be more clearly/tersely expressed as an exact assertion on a string representation of the tokenbuffer (maybe a special-purpose one)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59887/new/

https://reviews.llvm.org/D59887