[PATCH] D59887: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library

Thu Apr 11 07:30:23 PDT 2019

ilya-biryukov added a comment.

The new version address most of the comments, there are just a few left in the code doing the mapping from Dmitri, I'll look into simplifying and removing the possibly redundant checks.
Please take a look at this version, this is very close to the final state.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:72
+  /// macro or a name, arguments and parentheses of a function-like macro.
+  llvm::ArrayRef<syntax::Token> macroTokens(const TokenBuffer &B) const;
+  /// The range covering macroTokens().
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > ilya-biryukov wrote:
> > > gribozavr wrote:
> > > > `invocationTokens` or `macroInvocationTokens`
> > > The plan is to eventually include the macro directives tokens, hence the name.
> > > `invocationTokens` are somewhat inaccurate in that case, can't come up with a better name.
> > I think your reply applies to TokenBuffer::macroTokens(), not MacroInvocation::macroTokens().
> > 
> > +1 to invocationTokens here.
> Yeah, sorry, I have mixed up these two functions with the same name. Will change to `invocationTokens`.
The Mapping is now internal and only stores the indicies.
The names for two kinds of those are "raw" and "expanded", happy to consider alternatives for both.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:8
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_CLANG_TOOLING_SYNTAX_TOKEN_BUFFER_H
----------------
sammccall wrote:
> file comment?
> sometimes they're covered by the class comments, but I think there's a bit to say here.
> in particular the logical model (how we model the preprocessor, the two token streams and types of mappings between them) might go here.
Added a file comment explaining the basic concepts inside the file.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:32
+public:
+  Token() = default;
+  Token(SourceLocation Location, unsigned Length, tok::TokenKind Kind)
----------------
sammccall wrote:
> what's the default state (and why have one?) do you need a way to query whether a token is "valid"?
> (I'd avoid just relying on length() == 0 or location().isInvalid() because it's not obvious to callers this can happen)
Got rid of default state altogether. I haven' seen the use-cases where it's important so far.
Using `Optional<Token>` for invalid tokens seems like a cleaner design anyway (we did not need it so far).

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:60
+
+/// A top-level macro invocation inside a file, e.g.
+///   #define FOO 1+2
----------------
sammccall wrote:
> If you're going to say "invocation" in the name, say "use" in the comment (or vice versa!)
The class is now internal to `TokenBuffer` and is called `Mapping`.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:66
+///                            macroTokens = {'BAR', '(', '1', ')'}.
+class MacroInvocation {
+public:
----------------
sammccall wrote:
> I'd suggest a more natural order is token -> tokenbuffer -> macroinvocation. Forward declare?
> 
> This is because the token buffer exposes token streams, which are probably the second-most important concept in the file (after tokens)
Done. I forgot that you could have members of type `vector<T>`  where `T` is incomplete.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:74
+  /// The range covering macroTokens().
+  std::pair<unsigned, unsigned> macroRange(const TokenBuffer &B,
+                                           const SourceManager &SM) const;
----------------
sammccall wrote:
> It seems weirdly non-orthogonal to be able to get the range but not the file ID.
> 
> I'd suggest either adding another accessor for that, or returning a `struct BufferRange { FileID File; unsigned Begin, End; }` (with a better name.)
> 
> I favor the latter, because `.first` and `.second` don't offer the reader semantic hints at the callsite, and because that gives a nice place to document the half-open nature of the interval.
> 
> Actually, I'd suggest adding such a struct accessor to Token too.
Done. I've used the name `FileRange` instead (the idea is that it's pointing to a substring in a file).
Let me know what you think about the name.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:87
+  /// to the identifier token corresponding to a name of the expanded macro.
+  unsigned BeginMacroToken = 0;
+  /// End index in TokenBuffer::macroTokens().
----------------
sammccall wrote:
> please rename along with macroTokens() function
This is now `BeginRawToken` and `EndRawToken`.
As usually, open to suggestions wrt to naming.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:92
+
+/// A list of tokens obtained by lexing and preprocessing a text buffer and a
+/// set of helpers to allow mapping the tokens after preprocessing to the
----------------
sammccall wrote:
> ilya-biryukov wrote:
> > sammccall wrote:
> > > Warning, braindump follows - let's discuss further.
> > > We talked a bunch offline about the logical and concrete data model here.
> > > 
> > > As things stand:
> > >  - #includes are not expanded, but will refer to a file ID with its own buffer: `map<FileID, TokenBuffer>` is the whole structure
> > >  - no information is captured about other PP interactions (PP directives that generate no tokens, skipped sections
> > >  - the spelled sequence of tokens is not directly available (but expanded + macro invocations may be enough to reconstruct it)
> > > 
> > > If we keep this model, I think we should spell both of these out in the comment.
> > > But there's another fairly natural model we could consider:
> > > 
> > >  - we have a single TokenBuffer with a flat list of all expanded tokens in the TU, rather than for the file 
> > >  - one FileID corresponds to a contiguous range of *transitive* symbols, these ranges nest
> > >  - the TokenBuffer also stores the original tokens as `map<FileID, vector<Token>>`
> > >  - spelled -> expanded token mapping: spelled tokens for a file are partitioned into ranges (types: literal, include, macro invocation, other pp directive, pp skipped region). each range maps onto a range of expanded tokens (empty for e.g. pp directive or skipped region)
> > >  - expanded -> spelled token is similar. (It's almost the inverse of the of the other mapping, we drop the "include" mapping ranges)
> > >  - This can naturally handle comments, preprocessor directives, pp skipped sections, etc - these are in the spelled stream but not the expanded stream.
> > > 
> > > e.g. for this code
> > > ```
> > > // foo.cpp
> > > int a;
> > > #include "foo.h"
> > > int b =  FOO(42);
> > > // foo.h
> > > #define FOO(X) X*X
> > > int c;
> > > ```
> > > 
> > > we'd have this buffer:
> > > ```
> > > expandedTokens = [int a ; int c ; int b = 42 * 42 ;]
> > > spelledTokens = {
> > >   "foo.cpp" => [int a; # include "foo.h" int b = FOO ( 42 ) ;],
> > >   "foo.h" => [# define FOO ( X ) X * X int c ;]
> > > }
> > > 
> > > expandedToSpelling = {
> > >   int => int (foo.cpp), type = literal
> > >   a => a
> > >   ; => ;
> > > 
> > >   [] => [# define FOO ( X ) X * X](foo.h), type=preprocessor directive
> > >   int => int (foo.h)
> > >   c => c
> > >   ; => ;
> > > 
> > >   int => int (foo.cpp)
> > >   b => b
> > >   = => =
> > >   [42 * 42] => [FOO ( 42 ) ](foo.cpp), type=macro invocation
> > >   ; => ; (foo.cpp)
> > > }
> > > 
> > > spellingToExpanded = {
> > >   // foo.cpp
> > >   int => int, type=literal
> > >   a => a
> > >   ; => ;
> > >   [# include "foo.h"] => [int c ;], type=include
> > >   int => int
> > >   b => b
> > >   = => =
> > >   [FOO ( X )] => [42 * 42], type=macro invocation
> > >   ; => ;
> > > 
> > >   // foo.h
> > >   [# define FOO ( X ) X] => [], type=preprocessor directive
> > >   int => int
> > >   c => c
> > >   ; => ;
> > > }
> > > ```
> > > 
> > > Various implementation strategies possible here, one obvious one is to use a flat sorted list, and include a sequence of literal tokens as a single entry. 
> > > 
> > > ```
> > > struct ExpandedSpellingMapping {
> > >   unsigned ExpandedStart, ExpandedEnd;
> > >   FileID SpellingFile; // redundant for illustration: can actually derive from SpellingTokens[SpellingStart].location()
> > >   unsigned SpellingStart, SpellingEnd;
> > >   enum { LiteralSequence, MacroInvocation, Preprocessor, PPSkipped, Inclusion } Kind;
> > > }
> > > vector<ExpandedSpellingMapping> ExpandedToSpelling; // bsearchable
> > > vector<pair<FileID, ExpandedSpellingMapping>> Inclusions; // for spelling -> expanded mapping. redundant: much can be taken from SourceManager.
> > > ```
> > > 
> > > A delta-encoded representation with chunks for bsearch will likely be much more compact, if this proves large.
> > > (Kind of similar to the compressed posting lists in clangd Dex)
> > > 
> > > But as-is, the mappings for the example file would be:
> > > ```
> > > Expanded = {
> > >   {0, 3, File0, 0, 3, LiteralSequence}, // int a;
> > >   {3, 3, File1, 0, 8, Preprocessor}, // #define FOO(X) X * X
> > >   {3, 6, File1, 8, 11, LiteralSequence}, // int c;
> > >   {6, 9, File0, 6, 9, LiteralSequence}, // int b =
> > >   {9, 12, File0, 9, 13, MacroExpansion}, // FOO(42)
> > >   {12, 13, File0, 13, 14, LiteralSequence}, // ;
> > > }
> > > Inclusions = {
> > >   {File1, {3, 6, File0, 3, 6, Inclusion}}, // #include 
> > > }
> > > ```
> > Thanks for raising the multi-file issue earlier rather than later. My original intention was to model `TokenBuffer` as a stream of expanded tokens **for a single FileID**, i.e. we would have multiple `TokenBuffers` for multiple files.
> > 
> > Your suggestion of having a single set of expanded tokens and a map from FileID to vectors of raw tokens (i.e. before preprocessing) looks very nice. A single expanded token stream is definitely a better model for the syntax trees (each node of a tree would cover a continuous subrange of the expanded token stream).
> > 
> > In short, here are the highlights of the proposed model that I find most notable are:
> > - store a **single** stream of expanded tokens for a TU.
> > - store raw tokens for each file (mapped either by FileID or FileEntry).
> > - store information about preprocessor directives and macro replacements in each FileID (i.e. `MacroInvocation` in this patch)
> > - support an operation of mapping a subrange of expanded tokens into the subrange of raw-tokens in a particular FileID. This operation might fail.
> > 
> > Let me know if I missed something.
> Yes, I think we're on the same page.
> 
> Some more details:
> > store a single stream of expanded tokens for a TU.
> BTW I don't think we're giving up the ability to tokenize only a subset of files. If we filter out files, we can just omit their spelled token stream and omit their tokens from the expanded stream. There's complexity if we want to reject a file and accept its includes, but I think the most important case is "tokenize the main file only" where that's not an issue.
> 
> > store information about preprocessor directives and macro replacements in each FileID
> I do think it's useful to treat these as similar things - both for the implementation of mapping between streams, and for users of the API. My hope is that most callsites won't have to consider issues of macros, PP directives, ifdef'd code, etc as separate issues.
> 
> > support an operation of mapping a subrange of expanded tokens into the subrange of raw-tokens in a particular FileID. This operation might fail
> Yes, though I don't have a great intuition for what the API should look like
>  - if you select part of an expansion, should the result include the macro invocation, exclude, or fail?
>  - if the expansion range ends at a zero-size expansion (like a PP directive), should it be included or excluded?
>  - if the result spans files, is that a failure or do we represent it somehow?
> Some of these probably need to be configurable
> (We also have the mapping in the other direction, which shouldn't fail)
> Yes, though I don't have a great intuition for what the API should look like

Agree, and I think there's room for all of these options. For the sake of simplicity in the initial implementation, I would simply pick the set of trade-offs that work for moving macro calls that span full syntax subtrees around and document them.

When the use-cases pop up, we could add more configuration, refactor the APIs, etc.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:117
+  TokenBuffer() = default;
+  // Assumes no macro replacements have taken place.
+  TokenBuffer(std::vector<syntax::Token> Tokens);
----------------
sammccall wrote:
> `, preprocessor directives are parsed but not interpreted`
> 
> If using the model above, `this means that the spelled and expanded streams are identical`. (Or are we still going to strip comments?)
> 
Removed this constructor altogether, it does not make much sense in the new model.
Instead, `tokenize()` now returns the raw tokens directly in `vector<Token>`.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:146
+  llvm::Optional<std::pair<unsigned, unsigned>>
+  toOffsetRange(const Token *Begin, const Token *End,
+                const SourceManager &SM) const;
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > I'd think this would map a character range onto a character range, or a token range onto a token range - is there some reason otherwise?
> No particular reason, this was driven by its usage in function computing replacements (not in this patch, see [[ https://github.com/ilya-biryukov/llvm-project/blob/syntax/clang/lib/Tooling/Syntax/ComputeReplacements.cpp | github prototype ]], if interested, but keep in mind the prototype is not ready for review yet).
> 
> Mapping to a range of "raw" tokens seems more consistent with our model, will update accordingly, obtaining a range is easy after one has tokens from the file.
Added a method to map from an expanded token range to a raw token range.
Kept the method that maps an expanded token range to a character range too.

Note that we don't currently have a way to map from raw token range to expanded, that could be added later, we don't need it for the initial use-case (map the ranges coming from AST nodes into raw source ranges), so I left it out for now.

================
Comment at: clang/lib/Tooling/Syntax/TokenBuffer.cpp:61
+
+class TokenCollector::Callbacks : public PPCallbacks {
+public:
----------------
sammccall wrote:
> I'm afraid this really needs a class comment describing what this is supposed to do (easy!) and the implementation strategy (hard!)
> 
> In particular, it looks like macros like this:
>  - we expect to see the tokens making up the macro invocation first (... FOO ( 42 ))
>  - then we see MacroExpands which allows us to recognize the last N tokens are a macro invocation. We create a MacroInvocation object, and remember where the invocation ends
>  - then we see the tokens making up the macro expansion
>  - finally, once we see the next token after the invocation, we finalize the MacroInvocation.
Added a comment. The model is now a bit simpler (more work is done by the preprocessor), but let me know if the comment is still unclear and could be improved.

================
Comment at: clang/lib/Tooling/Syntax/TokenBuffer.cpp:285
+    std::tie(File, Offset) = SM.getDecomposedLoc(L);
+    if (File != MacroInvocationFile || Offset <= ExpansionEndOffset)
+      return;
----------------
sammccall wrote:
> is there a problem if we never see another token in that file after the expansion?
> (or do we see the eof token in this case?)
Yeah, there should always be an eof.
Added a comment and a test for this.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59887/new/

https://reviews.llvm.org/D59887