[PATCH] D59887: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library

Tue Apr 2 01:45:56 PDT 2019

ilya-biryukov added a comment.

Will address the rest of the comments later, answering a few questions that popped up first.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:72
+  /// macro or a name, arguments and parentheses of a function-like macro.
+  llvm::ArrayRef<syntax::Token> macroTokens(const TokenBuffer &B) const;
+  /// The range covering macroTokens().
----------------
sammccall wrote:
> ilya-biryukov wrote:
> > gribozavr wrote:
> > > `invocationTokens` or `macroInvocationTokens`
> > The plan is to eventually include the macro directives tokens, hence the name.
> > `invocationTokens` are somewhat inaccurate in that case, can't come up with a better name.
> I think your reply applies to TokenBuffer::macroTokens(), not MacroInvocation::macroTokens().
> 
> +1 to invocationTokens here.
Yeah, sorry, I have mixed up these two functions with the same name. Will change to `invocationTokens`.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:1
+//===- TokenBuffer.h - store tokens of preprocessed files -----*- C++ -*-=====//
+//
----------------
sammccall wrote:
> are you sure TokenBuffer is the central concept in this file, rather than just the thing with the most code? Token.h might end up being a better name for users.
I don't mind changing this to `Token.h`, although I'd personally expect that a file with this name only contains a definition for the token class.
`Tokens.h` would be a better fit from my POV. WDYT?

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:92
+
+/// A list of tokens obtained by lexing and preprocessing a text buffer and a
+/// set of helpers to allow mapping the tokens after preprocessing to the
----------------
sammccall wrote:
> Warning, braindump follows - let's discuss further.
> We talked a bunch offline about the logical and concrete data model here.
> 
> As things stand:
>  - #includes are not expanded, but will refer to a file ID with its own buffer: `map<FileID, TokenBuffer>` is the whole structure
>  - no information is captured about other PP interactions (PP directives that generate no tokens, skipped sections
>  - the spelled sequence of tokens is not directly available (but expanded + macro invocations may be enough to reconstruct it)
> 
> If we keep this model, I think we should spell both of these out in the comment.
> But there's another fairly natural model we could consider:
> 
>  - we have a single TokenBuffer with a flat list of all expanded tokens in the TU, rather than for the file 
>  - one FileID corresponds to a contiguous range of *transitive* symbols, these ranges nest
>  - the TokenBuffer also stores the original tokens as `map<FileID, vector<Token>>`
>  - spelled -> expanded token mapping: spelled tokens for a file are partitioned into ranges (types: literal, include, macro invocation, other pp directive, pp skipped region). each range maps onto a range of expanded tokens (empty for e.g. pp directive or skipped region)
>  - expanded -> spelled token is similar. (It's almost the inverse of the of the other mapping, we drop the "include" mapping ranges)
>  - This can naturally handle comments, preprocessor directives, pp skipped sections, etc - these are in the spelled stream but not the expanded stream.
> 
> e.g. for this code
> ```
> // foo.cpp
> int a;
> #include "foo.h"
> int b =  FOO(42);
> // foo.h
> #define FOO(X) X*X
> int c;
> ```
> 
> we'd have this buffer:
> ```
> expandedTokens = [int a ; int c ; int b = 42 * 42 ;]
> spelledTokens = {
>   "foo.cpp" => [int a; # include "foo.h" int b = FOO ( 42 ) ;],
>   "foo.h" => [# define FOO ( X ) X * X int c ;]
> }
> 
> expandedToSpelling = {
>   int => int (foo.cpp), type = literal
>   a => a
>   ; => ;
> 
>   [] => [# define FOO ( X ) X * X](foo.h), type=preprocessor directive
>   int => int (foo.h)
>   c => c
>   ; => ;
> 
>   int => int (foo.cpp)
>   b => b
>   = => =
>   [42 * 42] => [FOO ( 42 ) ](foo.cpp), type=macro invocation
>   ; => ; (foo.cpp)
> }
> 
> spellingToExpanded = {
>   // foo.cpp
>   int => int, type=literal
>   a => a
>   ; => ;
>   [# include "foo.h"] => [int c ;], type=include
>   int => int
>   b => b
>   = => =
>   [FOO ( X )] => [42 * 42], type=macro invocation
>   ; => ;
> 
>   // foo.h
>   [# define FOO ( X ) X] => [], type=preprocessor directive
>   int => int
>   c => c
>   ; => ;
> }
> ```
> 
> Various implementation strategies possible here, one obvious one is to use a flat sorted list, and include a sequence of literal tokens as a single entry. 
> 
> ```
> struct ExpandedSpellingMapping {
>   unsigned ExpandedStart, ExpandedEnd;
>   FileID SpellingFile; // redundant for illustration: can actually derive from SpellingTokens[SpellingStart].location()
>   unsigned SpellingStart, SpellingEnd;
>   enum { LiteralSequence, MacroInvocation, Preprocessor, PPSkipped, Inclusion } Kind;
> }
> vector<ExpandedSpellingMapping> ExpandedToSpelling; // bsearchable
> vector<pair<FileID, ExpandedSpellingMapping>> Inclusions; // for spelling -> expanded mapping. redundant: much can be taken from SourceManager.
> ```
> 
> A delta-encoded representation with chunks for bsearch will likely be much more compact, if this proves large.
> (Kind of similar to the compressed posting lists in clangd Dex)
> 
> But as-is, the mappings for the example file would be:
> ```
> Expanded = {
>   {0, 3, File0, 0, 3, LiteralSequence}, // int a;
>   {3, 3, File1, 0, 8, Preprocessor}, // #define FOO(X) X * X
>   {3, 6, File1, 8, 11, LiteralSequence}, // int c;
>   {6, 9, File0, 6, 9, LiteralSequence}, // int b =
>   {9, 12, File0, 9, 13, MacroExpansion}, // FOO(42)
>   {12, 13, File0, 13, 14, LiteralSequence}, // ;
> }
> Inclusions = {
>   {File1, {3, 6, File0, 3, 6, Inclusion}}, // #include 
> }
> ```
Thanks for raising the multi-file issue earlier rather than later. My original intention was to model `TokenBuffer` as a stream of expanded tokens **for a single FileID**, i.e. we would have multiple `TokenBuffers` for multiple files.

Your suggestion of having a single set of expanded tokens and a map from FileID to vectors of raw tokens (i.e. before preprocessing) looks very nice. A single expanded token stream is definitely a better model for the syntax trees (each node of a tree would cover a continuous subrange of the expanded token stream).

In short, here are the highlights of the proposed model that I find most notable are:
- store a **single** stream of expanded tokens for a TU.
- store raw tokens for each file (mapped either by FileID or FileEntry).
- store information about preprocessor directives and macro replacements in each FileID (i.e. `MacroInvocation` in this patch)
- support an operation of mapping a subrange of expanded tokens into the subrange of raw-tokens in a particular FileID. This operation might fail.

Let me know if I missed something.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:146
+  llvm::Optional<std::pair<unsigned, unsigned>>
+  toOffsetRange(const Token *Begin, const Token *End,
+                const SourceManager &SM) const;
----------------
sammccall wrote:
> I'd think this would map a character range onto a character range, or a token range onto a token range - is there some reason otherwise?
No particular reason, this was driven by its usage in function computing replacements (not in this patch, see [[ https://github.com/ilya-biryukov/llvm-project/blob/syntax/clang/lib/Tooling/Syntax/ComputeReplacements.cpp | github prototype ]], if interested, but keep in mind the prototype is not ready for review yet).

Mapping to a range of "raw" tokens seems more consistent with our model, will update accordingly, obtaining a range is easy after one has tokens from the file.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:171
+  ///        #pragma, etc.
+  llvm::ArrayRef<syntax::Token> macroTokens() const { return MacroTokens; }
+
----------------
sammccall wrote:
> not sure quite what this is for, but "tokens not in expanded stream" might include header-guarded #includes, comments, ifdef-'d sections, #defines...
> is this API intended to be a whitelist (macro invocations specifically) or a blacklist (everything that's not literally in the expanded stream)
This was supposed to be all **raw** tokens in a file, which are not part of the expanded token stream, i.e. all PP directives, macro-replaced-tokens (object-like and function-like macros), tokens of skipped else branches, etc.

In your proposed model we would store raw tokens of a file separately and this would not be necessary.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59887/new/

https://reviews.llvm.org/D59887