[PATCH] D59887: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library
Ilya Biryukov via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Tue Apr 2 01:45:56 PDT 2019
ilya-biryukov added a comment.
Will address the rest of the comments later, answering a few questions that popped up first.
================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:72
+ /// macro or a name, arguments and parentheses of a function-like macro.
+ llvm::ArrayRef<syntax::Token> macroTokens(const TokenBuffer &B) const;
+ /// The range covering macroTokens().
----------------
sammccall wrote:
> ilya-biryukov wrote:
> > gribozavr wrote:
> > > `invocationTokens` or `macroInvocationTokens`
> > The plan is to eventually include the macro directives tokens, hence the name.
> > `invocationTokens` are somewhat inaccurate in that case, can't come up with a better name.
> I think your reply applies to TokenBuffer::macroTokens(), not MacroInvocation::macroTokens().
>
> +1 to invocationTokens here.
Yeah, sorry, I have mixed up these two functions with the same name. Will change to `invocationTokens`.
================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:1
+//===- TokenBuffer.h - store tokens of preprocessed files -----*- C++ -*-=====//
+//
----------------
sammccall wrote:
> are you sure TokenBuffer is the central concept in this file, rather than just the thing with the most code? Token.h might end up being a better name for users.
I don't mind changing this to `Token.h`, although I'd personally expect that a file with this name only contains a definition for the token class.
`Tokens.h` would be a better fit from my POV. WDYT?
================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:92
+
+/// A list of tokens obtained by lexing and preprocessing a text buffer and a
+/// set of helpers to allow mapping the tokens after preprocessing to the
----------------
sammccall wrote:
> Warning, braindump follows - let's discuss further.
> We talked a bunch offline about the logical and concrete data model here.
>
> As things stand:
> - #includes are not expanded, but will refer to a file ID with its own buffer: `map<FileID, TokenBuffer>` is the whole structure
> - no information is captured about other PP interactions (PP directives that generate no tokens, skipped sections
> - the spelled sequence of tokens is not directly available (but expanded + macro invocations may be enough to reconstruct it)
>
> If we keep this model, I think we should spell both of these out in the comment.
> But there's another fairly natural model we could consider:
>
> - we have a single TokenBuffer with a flat list of all expanded tokens in the TU, rather than for the file
> - one FileID corresponds to a contiguous range of *transitive* symbols, these ranges nest
> - the TokenBuffer also stores the original tokens as `map<FileID, vector<Token>>`
> - spelled -> expanded token mapping: spelled tokens for a file are partitioned into ranges (types: literal, include, macro invocation, other pp directive, pp skipped region). each range maps onto a range of expanded tokens (empty for e.g. pp directive or skipped region)
> - expanded -> spelled token is similar. (It's almost the inverse of the of the other mapping, we drop the "include" mapping ranges)
> - This can naturally handle comments, preprocessor directives, pp skipped sections, etc - these are in the spelled stream but not the expanded stream.
>
> e.g. for this code
> ```
> // foo.cpp
> int a;
> #include "foo.h"
> int b = FOO(42);
> // foo.h
> #define FOO(X) X*X
> int c;
> ```
>
> we'd have this buffer:
> ```
> expandedTokens = [int a ; int c ; int b = 42 * 42 ;]
> spelledTokens = {
> "foo.cpp" => [int a; # include "foo.h" int b = FOO ( 42 ) ;],
> "foo.h" => [# define FOO ( X ) X * X int c ;]
> }
>
> expandedToSpelling = {
> int => int (foo.cpp), type = literal
> a => a
> ; => ;
>
> [] => [# define FOO ( X ) X * X](foo.h), type=preprocessor directive
> int => int (foo.h)
> c => c
> ; => ;
>
> int => int (foo.cpp)
> b => b
> = => =
> [42 * 42] => [FOO ( 42 ) ](foo.cpp), type=macro invocation
> ; => ; (foo.cpp)
> }
>
> spellingToExpanded = {
> // foo.cpp
> int => int, type=literal
> a => a
> ; => ;
> [# include "foo.h"] => [int c ;], type=include
> int => int
> b => b
> = => =
> [FOO ( X )] => [42 * 42], type=macro invocation
> ; => ;
>
> // foo.h
> [# define FOO ( X ) X] => [], type=preprocessor directive
> int => int
> c => c
> ; => ;
> }
> ```
>
> Various implementation strategies possible here, one obvious one is to use a flat sorted list, and include a sequence of literal tokens as a single entry.
>
> ```
> struct ExpandedSpellingMapping {
> unsigned ExpandedStart, ExpandedEnd;
> FileID SpellingFile; // redundant for illustration: can actually derive from SpellingTokens[SpellingStart].location()
> unsigned SpellingStart, SpellingEnd;
> enum { LiteralSequence, MacroInvocation, Preprocessor, PPSkipped, Inclusion } Kind;
> }
> vector<ExpandedSpellingMapping> ExpandedToSpelling; // bsearchable
> vector<pair<FileID, ExpandedSpellingMapping>> Inclusions; // for spelling -> expanded mapping. redundant: much can be taken from SourceManager.
> ```
>
> A delta-encoded representation with chunks for bsearch will likely be much more compact, if this proves large.
> (Kind of similar to the compressed posting lists in clangd Dex)
>
> But as-is, the mappings for the example file would be:
> ```
> Expanded = {
> {0, 3, File0, 0, 3, LiteralSequence}, // int a;
> {3, 3, File1, 0, 8, Preprocessor}, // #define FOO(X) X * X
> {3, 6, File1, 8, 11, LiteralSequence}, // int c;
> {6, 9, File0, 6, 9, LiteralSequence}, // int b =
> {9, 12, File0, 9, 13, MacroExpansion}, // FOO(42)
> {12, 13, File0, 13, 14, LiteralSequence}, // ;
> }
> Inclusions = {
> {File1, {3, 6, File0, 3, 6, Inclusion}}, // #include
> }
> ```
Thanks for raising the multi-file issue earlier rather than later. My original intention was to model `TokenBuffer` as a stream of expanded tokens **for a single FileID**, i.e. we would have multiple `TokenBuffers` for multiple files.
Your suggestion of having a single set of expanded tokens and a map from FileID to vectors of raw tokens (i.e. before preprocessing) looks very nice. A single expanded token stream is definitely a better model for the syntax trees (each node of a tree would cover a continuous subrange of the expanded token stream).
In short, here are the highlights of the proposed model that I find most notable are:
- store a **single** stream of expanded tokens for a TU.
- store raw tokens for each file (mapped either by FileID or FileEntry).
- store information about preprocessor directives and macro replacements in each FileID (i.e. `MacroInvocation` in this patch)
- support an operation of mapping a subrange of expanded tokens into the subrange of raw-tokens in a particular FileID. This operation might fail.
Let me know if I missed something.
================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:146
+ llvm::Optional<std::pair<unsigned, unsigned>>
+ toOffsetRange(const Token *Begin, const Token *End,
+ const SourceManager &SM) const;
----------------
sammccall wrote:
> I'd think this would map a character range onto a character range, or a token range onto a token range - is there some reason otherwise?
No particular reason, this was driven by its usage in function computing replacements (not in this patch, see [[ https://github.com/ilya-biryukov/llvm-project/blob/syntax/clang/lib/Tooling/Syntax/ComputeReplacements.cpp | github prototype ]], if interested, but keep in mind the prototype is not ready for review yet).
Mapping to a range of "raw" tokens seems more consistent with our model, will update accordingly, obtaining a range is easy after one has tokens from the file.
================
Comment at: clang/include/clang/Tooling/Syntax/TokenBuffer.h:171
+ /// #pragma, etc.
+ llvm::ArrayRef<syntax::Token> macroTokens() const { return MacroTokens; }
+
----------------
sammccall wrote:
> not sure quite what this is for, but "tokens not in expanded stream" might include header-guarded #includes, comments, ifdef-'d sections, #defines...
> is this API intended to be a whitelist (macro invocations specifically) or a blacklist (everything that's not literally in the expanded stream)
This was supposed to be all **raw** tokens in a file, which are not part of the expanded token stream, i.e. all PP directives, macro-replaced-tokens (object-like and function-like macros), tokens of skipped else branches, etc.
In your proposed model we would store raw tokens of a file separately and this would not be necessary.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D59887/new/
https://reviews.llvm.org/D59887
More information about the cfe-commits
mailing list