[PATCH] D59887: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library

Mon Apr 1 09:36:53 PDT 2019

sammccall added a comment.

incomplete, haven't reviewed token collector

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:72
+  /// macro or a name, arguments and parentheses of a function-like macro.
+  llvm::ArrayRef<syntax::Token> macroTokens(const TokenBuffer &B) const;
+  /// The range covering macroTokens().
----------------
ilya-biryukov wrote:
> gribozavr wrote:
> > `invocationTokens` or `macroInvocationTokens`
> The plan is to eventually include the macro directives tokens, hence the name.
> `invocationTokens` are somewhat inaccurate in that case, can't come up with a better name.
I think your reply applies to TokenBuffer::macroTokens(), not MacroInvocation::macroTokens().

+1 to invocationTokens here.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:74
+  /// The range covering macroTokens().
+  std::pair<unsigned, unsigned> macroRange(const TokenBuffer &B,
+                                           const SourceManager &SM) const;
----------------
ilya-biryukov wrote:
> gribozavr wrote:
> > `invocationSourceRange` or `macroInvocationSourceRange` depending on what you choose for the function above.
> > 
> WDYT about `range`?
> Given the name of the parent class, this should be unambiguous.
I'd personally prefer invocationRange for symmetry with invocationTokens (which probably can't just be called tokens). But range is OK.
Please document half-openness (shouldn't be necessary, but this is clang after all).

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:1
+//===- TokenBuffer.h - store tokens of preprocessed files -----*- C++ -*-=====//
+//
----------------
are you sure TokenBuffer is the central concept in this file, rather than just the thing with the most code? Token.h might end up being a better name for users.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:8
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CLANG_TOOLING_SYNTAX_TOKEN_BUFFER_H
----------------
file comment?
sometimes they're covered by the class comments, but I think there's a bit to say here.
in particular the logical model (how we model the preprocessor, the two token streams and types of mappings between them) might go here.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:32
+public:
+  Token() = default;
+  Token(SourceLocation Location, unsigned Length, tok::TokenKind Kind)
----------------
what's the default state (and why have one?) do you need a way to query whether a token is "valid"?
(I'd avoid just relying on length() == 0 or location().isInvalid() because it's not obvious to callers this can happen)

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:58
+
+static_assert(sizeof(Token) <= 16, "Token is unresonably large");
+
----------------
unresonably -> unreasonably

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:60
+
+/// A top-level macro invocation inside a file, e.g.
+///   #define FOO 1+2
----------------
If you're going to say "invocation" in the name, say "use" in the comment (or vice versa!)

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:74
+  /// The range covering macroTokens().
+  std::pair<unsigned, unsigned> macroRange(const TokenBuffer &B,
+                                           const SourceManager &SM) const;
----------------
It seems weirdly non-orthogonal to be able to get the range but not the file ID.

I'd suggest either adding another accessor for that, or returning a `struct BufferRange { FileID File; unsigned Begin, End; }` (with a better name.)

I favor the latter, because `.first` and `.second` don't offer the reader semantic hints at the callsite, and because that gives a nice place to document the half-open nature of the interval.

Actually, I'd suggest adding such a struct accessor to Token too.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:87
+  /// to the identifier token corresponding to a name of the expanded macro.
+  unsigned BeginMacroToken = 0;
+  /// End index in TokenBuffer::macroTokens().
----------------
please rename along with macroTokens() function

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:92
+
+/// A list of tokens obtained by lexing and preprocessing a text buffer and a
+/// set of helpers to allow mapping the tokens after preprocessing to the
----------------
Warning, braindump follows - let's discuss further.
We talked a bunch offline about the logical and concrete data model here.

As things stand:
 - #includes are not expanded, but will refer to a file ID with its own buffer: `map<FileID, TokenBuffer>` is the whole structure
 - no information is captured about other PP interactions (PP directives that generate no tokens, skipped sections
 - the spelled sequence of tokens is not directly available (but expanded + macro invocations may be enough to reconstruct it)

If we keep this model, I think we should spell both of these out in the comment.
But there's another fairly natural model we could consider:

 - we have a single TokenBuffer with a flat list of all expanded tokens in the TU, rather than for the file 
 - one FileID corresponds to a contiguous range of *transitive* symbols, these ranges nest
 - the TokenBuffer also stores the original tokens as `map<FileID, vector<Token>>`
 - spelled -> expanded token mapping: spelled tokens for a file are partitioned into ranges (types: literal, include, macro invocation, other pp directive, pp skipped region). each range maps onto a range of expanded tokens (empty for e.g. pp directive or skipped region)
 - expanded -> spelled token is similar. (It's almost the inverse of the of the other mapping, we drop the "include" mapping ranges)
 - This can naturally handle comments, preprocessor directives, pp skipped sections, etc - these are in the spelled stream but not the expanded stream.

e.g. for this code
```
// foo.cpp
int a;
#include "foo.h"
int b =  FOO(42);
// foo.h
#define FOO(X) X*X
int c;
```

we'd have this buffer:
```
expandedTokens = [int a ; int c ; int b = 42 * 42 ;]
spelledTokens = {
  "foo.cpp" => [int a; # include "foo.h" int b = FOO ( 42 ) ;],
  "foo.h" => [# define FOO ( X ) X * X int c ;]
}

expandedToSpelling = {
  int => int (foo.cpp), type = literal
  a => a
  ; => ;

  [] => [# define FOO ( X ) X * X](foo.h), type=preprocessor directive
  int => int (foo.h)
  c => c
  ; => ;

  int => int (foo.cpp)
  b => b
  = => =
  [42 * 42] => [FOO ( 42 ) ](foo.cpp), type=macro invocation
  ; => ; (foo.cpp)
}

spellingToExpanded = {
  // foo.cpp
  int => int, type=literal
  a => a
  ; => ;
  [# include "foo.h"] => [int c ;], type=include
  int => int
  b => b
  = => =
  [FOO ( X )] => [42 * 42], type=macro invocation
  ; => ;

  // foo.h
  [# define FOO ( X ) X] => [], type=preprocessor directive
  int => int
  c => c
  ; => ;
}
```

Various implementation strategies possible here, one obvious one is to use a flat sorted list, and include a sequence of literal tokens as a single entry. 

```
struct ExpandedSpellingMapping {
  unsigned ExpandedStart, ExpandedEnd;
  FileID SpellingFile; // redundant for illustration: can actually derive from SpellingTokens[SpellingStart].location()
  unsigned SpellingStart, SpellingEnd;
  enum { LiteralSequence, MacroInvocation, Preprocessor, PPSkipped, Inclusion } Kind;
}
vector<ExpandedSpellingMapping> ExpandedToSpelling; // bsearchable
vector<pair<FileID, ExpandedSpellingMapping>> Inclusions; // for spelling -> expanded mapping. redundant: much can be taken from SourceManager.
```

A delta-encoded representation with chunks for bsearch will likely be much more compact, if this proves large.
(Kind of similar to the compressed posting lists in clangd Dex)

But as-is, the mappings for the example file would be:
```
Expanded = {
  {0, 3, File0, 0, 3, LiteralSequence}, // int a;
  {3, 3, File1, 0, 8, Preprocessor}, // #define FOO(X) X * X
  {3, 6, File1, 8, 11, LiteralSequence}, // int c;
  {6, 9, File0, 6, 9, LiteralSequence}, // int b =
  {9, 12, File0, 9, 13, MacroExpansion}, // FOO(42)
  {12, 13, File0, 13, 14, LiteralSequence}, // ;
}
Inclusions = {
  {File1, {3, 6, File0, 3, 6, Inclusion}}, // #include 
}
```

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:99
+///       replacements occurred.
+/// The tokens for (1) are stored directly and can be accessed with the tokens()
+/// method. However, some of these tokens may come from macro invocations and so
----------------
expandedtokens

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:104
+///     #define FOO 10
+///     int a = FOO;  // no token '10' in the file, just 'FOO'
+///
----------------
(this example is confusing, there is a 10 in the file!)

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:117
+  TokenBuffer() = default;
+  // Assumes no macro replacements have taken place.
+  TokenBuffer(std::vector<syntax::Token> Tokens);
----------------
`, preprocessor directives are parsed but not interpreted`

If using the model above, `this means that the spelled and expanded streams are identical`. (Or are we still going to strip comments?)

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:146
+  llvm::Optional<std::pair<unsigned, unsigned>>
+  toOffsetRange(const Token *Begin, const Token *End,
+                const SourceManager &SM) const;
----------------
I'd think this would map a character range onto a character range, or a token range onto a token range - is there some reason otherwise?

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:171
+  ///        #pragma, etc.
+  llvm::ArrayRef<syntax::Token> macroTokens() const { return MacroTokens; }
+
----------------
not sure quite what this is for, but "tokens not in expanded stream" might include header-guarded #includes, comments, ifdef-'d sections, #defines...
is this API intended to be a whitelist (macro invocations specifically) or a blacklist (everything that's not literally in the expanded stream)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59887/new/

https://reviews.llvm.org/D59887