[PATCH] D119162: [Pseudo] Token/TokenStream, PP directive parser.

Sam McCall via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Mon Feb 21 11:13:24 PST 2022


sammccall added inline comments.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Preprocess.h:49
+/// | `                                printf("hello, ");
+/// |-+ Conditional -+ Directive     #ifndef NDEBUG
+/// | |-+ Code                         printf("debug\n");
----------------
hokein wrote:
> is the `-+ Directive` expected?
Yes, this is `Cond.Branches[0].first`.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Preprocess.h:103
+// FIXME: This approximates std::variant<Code, Directive, Conditional>.
+//         Switch once we can use C++17.
+class PPStructure::Chunk {
----------------
hokein wrote:
> yeah, this should be happened soon, see https://discourse.llvm.org/t/rfc-increasing-the-gcc-and-clang-requirements-to-support-c-17-in-llvm/59983.
I hope so, but I'm not holding my breath.
(It'll happen soon enough that I'm not going to worry about optimizing this though)


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:64
+  /// Zero-based line number.
+  uint32_t Line = 0;
+  /// Width of whitespace before the first token on this line.
----------------
hokein wrote:
> nit: mention this is the line number for the start of token, a token can be multiple-line.
> 
> I think the meaning of line number is token-stream-dependent, right? For a lex token-stream, it is raw line number; For a cook token stream, it might take the backslash continued line into account?
> 
> mention this is the line number for the start of token
Done.

> I think the meaning of line number is token-stream-dependent, right? 
This always refers to the original source code - a backslash continued line won't reduce the line number. Expanded the comment.



================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:89
+  uint32_t size() const { return End - Begin; }
+  static Range empty(unsigned Index) { return Range{Index, Index}; }
+};
----------------
hokein wrote:
> nit: unsigned -> Token::Index
> 
> I'd probably drop it (looks like only one place using it, inclining it should be fine), the name empty is usually a method checking the Range is empty rather than creating an empty Range.
This will be used in a bunch more places, and inlining it is awkward:
 - often have to extract an expression to a variable to avoid duplicating it
 - it's IMO much less obvious at the callsite the intent is to specify an empty range

You're right about the name though, renamed to emptyAt which hopefully avoids that association.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:130
+    assert(&T != Storage.data() && "start sentinel");
+    assert(&T >= Storage.data() && &T < Storage.data() + Storage.size());
+    return &T - Tokens.data();
----------------
hokein wrote:
> `>=` => `>`, since we have checked above.
I'm not sure why that change would be an improvement, can you elaborate?

The two assertions are that we own the token, and that it's not the start sentinel. I think changing the condition makes that less clear.

(Reordered them though, as the start-sentinel assertion seems secondary)


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Token.h:166
+/// Tokens at the start of an unescaped newline (where a directive may start)
+/// will have StartsPPLine.
+/// "word-like" tokens such as identifiers and keywords will be raw_identifier.
----------------
hokein wrote:
> I think it would be clearer to give more details (for example, give some examples) in the comment to clarify -- it took me quite a while to understand "an unescaped newline" (I guess this refers to the backslash continued line?).
> 
> And having a PP in the name seems a little confusing, it somehow indicates this flag  is PP-specific, e.g. for a simple example without any PP structure `int a;` the `int` token has the  `StartsPPLine` being set.
Oops, this sentence didn't make much sense. The point is that the preprocessor distinguishes between logical and physical lines:

```
#define X(error) \
  #error // this is not a directive
auto m = X(foo); // string "foo"
```

This concept of "logical line" and the rules around it are specific to the preprocessor. The rest of the parser is line-agnostic or uses physical line numbers. So I think PP belongs in the name.

Expanded the comment here.


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Lex.cpp:65
+
+    if (CT.isAtStartOfLine())
+      Tok.setFlag(LexFlags::StartsPPLine);
----------------
hokein wrote:
> just want to double-check this is the behavior we want,
> 
> From the clang's token comment:
> ```
> TokenFlags::StartOfLine  // At start of line or only after whitespace
> ````
> 
> so the `abc` token will have this flag
> ```
>         abc // abc has two leading spaces.
Yes. This matches the logic clang uses (Lexer.cpp:657) to decide whether a hash starts a PP directive, which is what this flag is intended to emulate.


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Lex.cpp:66
+    if (CT.isAtStartOfLine())
+      Tok.setFlag(LexFlags::StartsPPLine);
+    if (CT.needsCleaning() || CT.hasUCN())
----------------
hokein wrote:
> If I read the code correctly, StartsPPLine is not PP-specific, the flag will be set for all tokens where they are at start of the line. Having a PP in the flag name seems a little confusing.
> 
> ```
> abc // abc token will have StartsPPLine set
> ```
the semantics are PP-specific, only the preprocessor cares about this unusual definition of "line".

In your example, `abc // abc token...` is indeed a preprocessor line (though not a directive). In particular:
```
#define foo
abc // abc token...
```

The `#define` directive is terminated by the `abc` token precisely because it starts a new PP logical line.


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Preprocess.cpp:45
+  // Parses tokens starting at Tok into PP.
+  // If we reach an #end or #else directive that ends PP, returns it.
+  // If TopLevel is true, then we do not expect #end and always return None.
----------------
hokein wrote:
> #end => #endif.
Changed these to End etc instead.
While `#endif` is the only End directive, that's incidental (and not the case for If or Else)


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Preprocess.cpp:145
+
+static llvm::StringLiteral ppKeywordName(tok::PPKeywordKind kind) {
+  switch (kind) {
----------------
hokein wrote:
> hokein wrote:
> > nit: Kind
> looks like this function can be lifted to TokenKinds.h file (there is a similar getTokenName there), maybe a FIXME?
Makes sense, moved it in this patch.


================
Comment at: clang/unittests/Tooling/Syntax/Pseudo/TokenTest.cpp:168
+                                lineIndent(1, 0),
+                                lineIndent(2, 0),
+                            }));
----------------
hokein wrote:
> why the the indentation is 0? 
It was reusing the indentation from the continued line. I'm not really sure why I thought this was a good idea though. Removed this logic, indentation is now 2.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119162/new/

https://reviews.llvm.org/D119162



More information about the cfe-commits mailing list