[PATCH] D119162: [Pseudo] Token/TokenStream, PP directive parser.

Wed Feb 16 03:25:14 PST 2022

hokein added a comment.

Thanks, this looks good in general.

I only reviewed the token part, but don't want to block the review. I think it might be nice to split out the preprocessor part from this patch, this would make the scope clearer and review efficient (sorry, quite busy with other stuff this week).

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Preprocess.h:103
+// FIXME: This approximates std::variant<Code, Directive, Conditional>.
+//         Switch once we can use C++17.
+class PPStructure::Chunk {
----------------
yeah, this should be happened soon, see https://discourse.llvm.org/t/rfc-increasing-the-gcc-and-clang-requirements-to-support-c-17-in-llvm/59983.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:60
+  StringRef text() const { return StringRef(Data, Length); }
+  const char *Data;
+  uint32_t Length;
----------------
nit: add default init value.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:64
+  /// Zero-based line number.
+  uint32_t Line = 0;
+  /// Width of whitespace before the first token on this line.
----------------
nit: mention this is the line number for the start of token, a token can be multiple-line.

I think the meaning of line number is token-stream-dependent, right? For a lex token-stream, it is raw line number; For a cook token stream, it might take the backslash continued line into account?

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:89
+  uint32_t size() const { return End - Begin; }
+  static Range empty(unsigned Index) { return Range{Index, Index}; }
+};
----------------
nit: unsigned -> Token::Index

I'd probably drop it (looks like only one place using it, inclining it should be fine), the name empty is usually a method checking the Range is empty rather than creating an empty Range.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:128
+  /// T must be within the stream or the end sentinel (not the start sentinel).
+  Token::Index index(const Token &T) const {
+    assert(&T != Storage.data() && "start sentinel");
----------------
nit: isFinalized assert?

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:130
+    assert(&T != Storage.data() && "start sentinel");
+    assert(&T >= Storage.data() && &T < Storage.data() + Storage.size());
+    return &T - Tokens.data();
----------------
`>=` => `>`, since we have checked above.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:143
+  /// May return the end sentinel if the stream is empty.
+  Token &front() { return Storage[1]; }
+  const Token &front() const { return Storage[1]; }
----------------
the mutated version seems not used?

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:144
+  Token &front() { return Storage[1]; }
+  const Token &front() const { return Storage[1]; }
+
----------------
`assert(isFinalized())`

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:155
+  MutableArrayRef<Token> Tokens;
+  std::vector<Token> Storage;
+};
----------------
I think it is probably good/clearer to have a comment like "Storage = eof + Tokens + eof" around here.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:163
+///
+/// Tokens containing trigraps, escaped newlines, UCNs etc will have text() that
+/// reflect this, and will have NeedsCleaning set.
----------------
 trigraps => trigraphs

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:166
+/// Tokens at the start of an unescaped newline (where a directive may start)
+/// will have StartsPPLine.
+/// "word-like" tokens such as identifiers and keywords will be raw_identifier.
----------------
I think it would be clearer to give more details (for example, give some examples) in the comment to clarify -- it took me quite a while to understand "an unescaped newline" (I guess this refers to the backslash continued line?).

And having a PP in the name seems a little confusing, it somehow indicates this flag  is PP-specific, e.g. for a simple example without any PP structure `int a;` the `int` token has the  `StartsPPLine` being set.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:171
+  NeedsCleaning = 1 << 0,
+  StartsPPLine = 1 << 1,
+};
----------------
I was going to say "add some comments around these enums", then realized that there were comments for them in the above `lex()` comment, I'd  move those comments to here.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Lex.cpp:65
+
+    if (CT.isAtStartOfLine())
+      Tok.setFlag(LexFlags::StartsPPLine);
----------------
just want to double-check this is the behavior we want,

>From the clang's token comment:
```
TokenFlags::StartOfLine  // At start of line or only after whitespace
````

so the `abc` token will have this flag
```
        abc // abc has two leading spaces.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Lex.cpp:66
+    if (CT.isAtStartOfLine())
+      Tok.setFlag(LexFlags::StartsPPLine);
+    if (CT.needsCleaning() || CT.hasUCN())
----------------
If I read the code correctly, StartsPPLine is not PP-specific, the flag will be set for all tokens where they are at start of the line. Having a PP in the flag name seems a little confusing.

```
abc // abc token will have StartsPPLine set
```

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Lex.cpp:88
+      while (Pos < Tok.text().end()) {
+        unsigned CharSize;
+        CleanBuffer.push_back(
----------------
nit: = 0

================
Comment at: clang/unittests/Tooling/Syntax/Pseudo/TokenTest.cpp:108
+                                hasFlag(LexFlags::StartsPPLine),
+                                hasFlag(LexFlags::NeedsCleaning)),
+                          AllOf(token("two", tok::raw_identifier),
----------------
can we add some line&indentation verifications here?

================
Comment at: clang/unittests/Tooling/Syntax/Pseudo/TokenTest.cpp:117
+  TokenStream Cooked = cook(Raw, Opts);
+  EXPECT_THAT(Cooked.tokens(), ElementsAre(token("one_token", tok::identifier),
+                                           token("two", tok::identifier),
----------------
and here as well, line number & indentation.

================
Comment at: clang/unittests/Tooling/Syntax/Pseudo/TokenTest.cpp:165
+  EXPECT_THAT(Raw.tokens(), ElementsAreArray({
+                                lineIndent(0, 3),
+                                lineIndent(0, 3),
----------------
nit: add trailing comments for each lineIndent, e.g. `// hello`

================
Comment at: clang/unittests/Tooling/Syntax/Pseudo/TokenTest.cpp:168
+                                lineIndent(1, 0),
+                                lineIndent(2, 0),
+                            }));
----------------
why the the indentation is 0? 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119162/new/

https://reviews.llvm.org/D119162