[PATCH] D59887: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library

Tue Apr 2 06:08:33 PDT 2019

sammccall added inline comments.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:1
+//===- TokenBuffer.h - store tokens of preprocessed files -----*- C++ -*-=====//
+//
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > are you sure TokenBuffer is the central concept in this file, rather than just the thing with the most code? Token.h might end up being a better name for users.
> I don't mind changing this to `Token.h`, although I'd personally expect that a file with this name only contains a definition for the token class.
> `Tokens.h` would be a better fit from my POV. WDYT?
Sure, Tokens.h SGTM.

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:66
+///                            macroTokens = {'BAR', '(', '1', ')'}.
+class MacroInvocation {
+public:
----------------
I'd suggest a more natural order is token -> tokenbuffer -> macroinvocation. Forward declare?

This is because the token buffer exposes token streams, which are probably the second-most important concept in the file (after tokens)

================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:92
+
+/// A list of tokens obtained by lexing and preprocessing a text buffer and a
+/// set of helpers to allow mapping the tokens after preprocessing to the
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > Warning, braindump follows - let's discuss further.
> > We talked a bunch offline about the logical and concrete data model here.
> > 
> > As things stand:
> >  - #includes are not expanded, but will refer to a file ID with its own buffer: `map<FileID, TokenBuffer>` is the whole structure
> >  - no information is captured about other PP interactions (PP directives that generate no tokens, skipped sections
> >  - the spelled sequence of tokens is not directly available (but expanded + macro invocations may be enough to reconstruct it)
> > 
> > If we keep this model, I think we should spell both of these out in the comment.
> > But there's another fairly natural model we could consider:
> > 
> >  - we have a single TokenBuffer with a flat list of all expanded tokens in the TU, rather than for the file 
> >  - one FileID corresponds to a contiguous range of *transitive* symbols, these ranges nest
> >  - the TokenBuffer also stores the original tokens as `map<FileID, vector<Token>>`
> >  - spelled -> expanded token mapping: spelled tokens for a file are partitioned into ranges (types: literal, include, macro invocation, other pp directive, pp skipped region). each range maps onto a range of expanded tokens (empty for e.g. pp directive or skipped region)
> >  - expanded -> spelled token is similar. (It's almost the inverse of the of the other mapping, we drop the "include" mapping ranges)
> >  - This can naturally handle comments, preprocessor directives, pp skipped sections, etc - these are in the spelled stream but not the expanded stream.
> > 
> > e.g. for this code
> > ```
> > // foo.cpp
> > int a;
> > #include "foo.h"
> > int b =  FOO(42);
> > // foo.h
> > #define FOO(X) X*X
> > int c;
> > ```
> > 
> > we'd have this buffer:
> > ```
> > expandedTokens = [int a ; int c ; int b = 42 * 42 ;]
> > spelledTokens = {
> >   "foo.cpp" => [int a; # include "foo.h" int b = FOO ( 42 ) ;],
> >   "foo.h" => [# define FOO ( X ) X * X int c ;]
> > }
> > 
> > expandedToSpelling = {
> >   int => int (foo.cpp), type = literal
> >   a => a
> >   ; => ;
> > 
> >   [] => [# define FOO ( X ) X * X](foo.h), type=preprocessor directive
> >   int => int (foo.h)
> >   c => c
> >   ; => ;
> > 
> >   int => int (foo.cpp)
> >   b => b
> >   = => =
> >   [42 * 42] => [FOO ( 42 ) ](foo.cpp), type=macro invocation
> >   ; => ; (foo.cpp)
> > }
> > 
> > spellingToExpanded = {
> >   // foo.cpp
> >   int => int, type=literal
> >   a => a
> >   ; => ;
> >   [# include "foo.h"] => [int c ;], type=include
> >   int => int
> >   b => b
> >   = => =
> >   [FOO ( X )] => [42 * 42], type=macro invocation
> >   ; => ;
> > 
> >   // foo.h
> >   [# define FOO ( X ) X] => [], type=preprocessor directive
> >   int => int
> >   c => c
> >   ; => ;
> > }
> > ```
> > 
> > Various implementation strategies possible here, one obvious one is to use a flat sorted list, and include a sequence of literal tokens as a single entry. 
> > 
> > ```
> > struct ExpandedSpellingMapping {
> >   unsigned ExpandedStart, ExpandedEnd;
> >   FileID SpellingFile; // redundant for illustration: can actually derive from SpellingTokens[SpellingStart].location()
> >   unsigned SpellingStart, SpellingEnd;
> >   enum { LiteralSequence, MacroInvocation, Preprocessor, PPSkipped, Inclusion } Kind;
> > }
> > vector<ExpandedSpellingMapping> ExpandedToSpelling; // bsearchable
> > vector<pair<FileID, ExpandedSpellingMapping>> Inclusions; // for spelling -> expanded mapping. redundant: much can be taken from SourceManager.
> > ```
> > 
> > A delta-encoded representation with chunks for bsearch will likely be much more compact, if this proves large.
> > (Kind of similar to the compressed posting lists in clangd Dex)
> > 
> > But as-is, the mappings for the example file would be:
> > ```
> > Expanded = {
> >   {0, 3, File0, 0, 3, LiteralSequence}, // int a;
> >   {3, 3, File1, 0, 8, Preprocessor}, // #define FOO(X) X * X
> >   {3, 6, File1, 8, 11, LiteralSequence}, // int c;
> >   {6, 9, File0, 6, 9, LiteralSequence}, // int b =
> >   {9, 12, File0, 9, 13, MacroExpansion}, // FOO(42)
> >   {12, 13, File0, 13, 14, LiteralSequence}, // ;
> > }
> > Inclusions = {
> >   {File1, {3, 6, File0, 3, 6, Inclusion}}, // #include 
> > }
> > ```
> Thanks for raising the multi-file issue earlier rather than later. My original intention was to model `TokenBuffer` as a stream of expanded tokens **for a single FileID**, i.e. we would have multiple `TokenBuffers` for multiple files.
> 
> Your suggestion of having a single set of expanded tokens and a map from FileID to vectors of raw tokens (i.e. before preprocessing) looks very nice. A single expanded token stream is definitely a better model for the syntax trees (each node of a tree would cover a continuous subrange of the expanded token stream).
> 
> In short, here are the highlights of the proposed model that I find most notable are:
> - store a **single** stream of expanded tokens for a TU.
> - store raw tokens for each file (mapped either by FileID or FileEntry).
> - store information about preprocessor directives and macro replacements in each FileID (i.e. `MacroInvocation` in this patch)
> - support an operation of mapping a subrange of expanded tokens into the subrange of raw-tokens in a particular FileID. This operation might fail.
> 
> Let me know if I missed something.
Yes, I think we're on the same page.

Some more details:
> store a single stream of expanded tokens for a TU.
BTW I don't think we're giving up the ability to tokenize only a subset of files. If we filter out files, we can just omit their spelled token stream and omit their tokens from the expanded stream. There's complexity if we want to reject a file and accept its includes, but I think the most important case is "tokenize the main file only" where that's not an issue.

> store information about preprocessor directives and macro replacements in each FileID
I do think it's useful to treat these as similar things - both for the implementation of mapping between streams, and for users of the API. My hope is that most callsites won't have to consider issues of macros, PP directives, ifdef'd code, etc as separate issues.

> support an operation of mapping a subrange of expanded tokens into the subrange of raw-tokens in a particular FileID. This operation might fail
Yes, though I don't have a great intuition for what the API should look like
 - if you select part of an expansion, should the result include the macro invocation, exclude, or fail?
 - if the expansion range ends at a zero-size expansion (like a PP directive), should it be included or excluded?
 - if the result spans files, is that a failure or do we represent it somehow?
Some of these probably need to be configurable
(We also have the mapping in the other direction, which shouldn't fail)

================
Comment at: clang/lib/Tooling/Syntax/TokenBuffer.cpp:42
+  auto AddToken = [&](clang::Token T) {
+    if (T.getKind() == tok::raw_identifier && !T.needsCleaning() &&
+        !T.hasUCN()) { // FIXME: support needsCleaning and hasUCN cases.
----------------
this is recognizing keywords, right?
add a comment?

================
Comment at: clang/lib/Tooling/Syntax/TokenBuffer.cpp:70
+    assert(Loc.isFileID());
+    InsideMainFile = SM.getFileID(Loc) == SM.getMainFileID();
+    flushMacroInvocation();
----------------
I suspect this state isn't needed if we use a single tokenbuffer for all files

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59887/new/

https://reviews.llvm.org/D59887