[PATCH] D88299: [clang-format] Add MacroUnexpander.

Fri Nov 26 10:26:30 PST 2021

klimek added a comment.

Noticed I should have waiting with the renaming of the file until the review is done :( Sorry for the extra confusion.

================
Comment at: clang/lib/Format/MacroUnexpander.cpp:77
+  // stream, we need to continue the unexpansion until we find the right token
+  // if it is part of the unexpanded token stream.
+  // Note that hidden tokens can be part of the unexpanded stream in nested
----------------
sammccall wrote:
> I can't work out what "it" refers to in this sentence.
> 
> (and "spelled" token stream?)
Changed to "given token" - it refers to the token.

It's reconstructed, not spelled, like the example below explains: we do have hidden tokens that we want to find when we're in the middle of a recursive expansion. I wouldn't call the hidden tokens here "spelled" - would you?

================
Comment at: clang/lib/Format/MacroUnexpander.cpp:82
+  // And the call: BRACED(BRACED(x))
+  // The outer macro call will be BRACED({a}), and the hidden tokens '{' and
+  // '}' can be found in the unexpanded macro stream of that level.
----------------
sammccall wrote:
> It would be helpful to complete the example by spelling out which token you're adding, which the correct parent is, and which tokens you need to "expand over" to make it available.
> 
> I *think* the answer to the first two is - when you're adding the `a` then its proper parent is the inner `(` in `BRACED(BRACED(`... but I don't know the last part.
Found out that I was missing a unit test. Added a unit test, and now explained the unit test here in the comment. PTAL.

================
Comment at: clang/lib/Format/MacroUnexpander.cpp:84
+  // '}' can be found in the unexpanded macro stream of that level.
+  if (!Unexpanded.empty() && Token->MacroCtx &&
+      (Token->MacroCtx->Role != MR_Hidden ||
----------------
sammccall wrote:
> is it possible that you may need to unexpand more than the innermost macro?
> 
> e.g. BRACED(ID(ID(BRACED(ID(ID(a)))))) expands to {{a}} and the parents of `a` and inner `{` each come from a couple of macro-levels up.
> 
> (I don't totally understand the logic here, so the answer's probably "no"...)
A token on a higher level must always also be there on a lower level.
Calls in this example are:
BRACED({a})
ID({a})
ID({a})
BRACED(a)
ID(a)
ID(a)
When the next token comes in, we thus always find it in the higher level (open) macro call. When we reconstruct for that token, we then reconstruct multiple macro calls at once, if it is the first token in multiple macro calls.

================
Comment at: clang/lib/Format/MacroUnexpander.cpp:173
+  assert(Token->MacroCtx);
+  // A single token can be the only result of a macro call:
+  // Given: ID(x, y) ;
----------------
sammccall wrote:
> This raises another point: a macro can have an empty body.
> AFAICT this fundamentally isn't supported here, as we're driven by the expanded token stream.
> I guess it makes sense to handle this elsewhere in clang-format (or even not handle it) but it should be documented somewhere.
I think the right point to document and handle it is at the next layer, where we integrate all of this into clang-format.

================
Comment at: clang/lib/Format/MacroUnexpander.cpp:238
+  // generated by the call.
+  for (size_t I = Unexpanded.size(); I < Token->MacroCtx->ExpandedFrom.size();
+       ++I) {
----------------
sammccall wrote:
> nit: index arithmetic obscures what's going on a bit.
> You could write
> 
> ```
> ArrayRef<FormatToken *> StartedMacros = makeArrayRef(Token->MacroCtx->ExpandedFrom).drop_front(Unexpanded.size());
> for (FormatToken *ID : llvm::reverse(StartedMacros)) {
> ...
> }
> ```
> but up to you
> 
> It's not obvious to me *why* we're iterating in reverse order BTW: i.e. why the order of the `Unexpanded` stack is opposite the order of the `ExpandedFrom` stack.
> So maybe just a comment to reinforce that, like (// ExpandedFrom is outside-in order, we want inside-out)
Oooh, this is nice. s/drop_front/drop_back/. Added comment.

================
Comment at: clang/lib/Format/MacroUnexpander.cpp:282
+         (Unexpanded.size() >= Token->MacroCtx->EndOfExpansion));
+  if (!Token->MacroCtx)
+    return;
----------------
sammccall wrote:
> nit: Token->MacroCtx is checked at the callsite, and startUnexpansion asserts it as a precondition - do that here too for symmetry?
Thanks for spotting, this was from a time where the code looked differently (the assert above also didn't make sense any more in that form).

================
Comment at: clang/lib/Format/MacroUnexpander.cpp:310
+    // Expand the closing parenthesis, if it exists, including an optional
+    // trailing comment.
+    for (auto T = Unexpanded.back().I; T != Unexpanded.back().End; ++T) {
----------------
sammccall wrote:
> I can't work out if this is supposed to say comma or comment :-)
> 
> If comma - is that a thing?
> If comment - why would a comment be considered part of the unexpansion of a macro invocation, rather than a sibling to the macro? How do we know what trails is a comment - should we have an assertion?
Added words.

Re: the trailing comment, that's an idiosyncrasy of who clang-format handles trailing comments - it parses them with the token preceding them. We could probably spend more complexity on the call side, trying to break them out, but I'm not sure that's better, given that this point already needs to be fairly reliable against all user code.

================
Comment at: clang/lib/Format/MacroUnexpander.cpp:348
+    if (Token->is(tok::comma)) {
+      TokenToParentInOutput
+          [MacroCallStructure.back().Line->Tokens.back()->Tok] = Token;
----------------
sammccall wrote:
> This line is both subtle and cryptic :-).
> ```
> // New unwrapped-lines from unexpansion are normally "chained" - treated as
> // children of the last token of the previous line. 
> // Redirect them to be treated as children of the comma instead.
> // (only affects lines added before we push more tokens into MacroStructure.Line)
> ```
This is partially captured in the comment above. Added a shorter description here, let me know if that's not enough.

================
Comment at: clang/lib/Format/MacroUnexpander.cpp:366
+  }
+  if (!Token->MacroCtx && Token->is(tok::l_paren)) {
+    MacroCallStructure.push_back(
----------------
sammccall wrote:
> This feels like a stupid question, but how do we know that this non-macro-parenthesis has anything to do with macros?
> (Fairly sure the answer is "processNextUnexpanded() is only called when the token is macro-related", it'd be nice to have this precondition spelled out somewhere)
Added a comment above the restructured !Token->MacroCtx check.

================
Comment at: clang/lib/Format/MacroUnexpander.cpp:367
+  if (!Token->MacroCtx && Token->is(tok::l_paren)) {
+    MacroCallStructure.push_back(
+        {currentLine(), parentLine().Tokens.back()->Tok, Token});
----------------
sammccall wrote:
> please assign to the struct fields or use /*Foo=*/ comments
> 
> All the data flow around this struct is local but the order of fields isn't :-)
Added a constructor.

================
Comment at: clang/lib/Format/MacroUnexpander.cpp:369
+        {currentLine(), parentLine().Tokens.back()->Tok, Token});
+    TokenToParentInOutput[MacroCallStructure.back().RedirectParentFrom] = Token;
+    pushToken(Token);
----------------
sammccall wrote:
> nit: = MacroCallStructure.back().RedirectParentTo
> This line + line 361 are the keys to understanding what the RedirectParentFrom/To do, so it'd be helpful for them to explicitly use those variables.
> 
> (or use the source expressions for both... but in that case the fields may need docs)
1. Line 361 was not actually necessary; I first thought that was a bug, but then dug in, and noticed that we weren't able to hit that code path (anymore, after restructuring)

2. Completely renamed the members and documented them, added some more asserts to make things hopefully clearer - thanks for pointing it out, those were clearly underdocumented; I hope it's now more clear.

================
Comment at: clang/lib/Format/MacroUnexpander.cpp:406
+  for (auto *E = Output.Tokens.front()->Children.end(); I != E; ++I) {
+    if ((*I)->Tokens.empty())
+      continue;
----------------
sammccall wrote:
> hmm, when is this possible? are these literal blank lines? where do they come from?
Looks like you're right; this looks like it was left over from a previous different structure, I'll make sure to fuzz it thoroughly.

================
Comment at: clang/lib/Format/MacroUnexpander.cpp:409
+
+    // Every subsequent line will become a child of the last token in the
+    // previous line, which is the token prior to the first token in the line.
----------------
sammccall wrote:
> Isn't this just the end of line from the previous iteration of this loop? Why the global map populated by pushToken?
> (I feel like i'm missing something obvious here)
Nice catch, the complexity here was also from a previous iteration where the structure of the algorithm was (even) more complex, where we needed to also do stitching for non-top-level lines. Now able to completely get rid of PreviousToken and TokenToPrevious \o/

================
Comment at: clang/lib/Format/Macros.h:30
 ///
 /// When formatting, clang-format formats the expanded unwrapped lines first,
 /// determining the token types. Next, it formats the spelled unwrapped lines,
----------------
sammccall wrote:
> would s/formats/annotates/ be inaccurate?
> 
> This is just my poor understanding of the code, but it wasn't obvious to me that annotation is closely associated with formatting and not with parsing UnwrappedLines.
Said "annotates and formats" now - yes, the fact that annotating and formatting is so coupled is a admittedly weird choice of the initial design.

================
Comment at: clang/lib/Format/Macros.h:135
+/// Converts a stream of unwrapped lines of expanded tokens into unwrapped lines
+/// containing the tokens of the macro call in a way that the resulting
+/// unwrapped lines of the macro call best resemble the split of unwrapped lines
----------------
sammccall wrote:
> I know it's the common case, but I think saying "the macro call" here is misleading, because it quickly becomes apparent reading the code that the scope *isn't* one macro call, and (at least for me) it's easy to get hung up on not understanding what the scope is. (AIUI the scope is actually one UL of *output*... so the use of plural there is also confusing).
> 
> I think a  escription could be something like:
> 
> Converts a sequence of UnwrappedLines containing expanded macros into a single UnwrappedLine containing the macro calls.
> This UnwrappedLine may be broken into child lines, in a way that best conveys the structure of the expanded code.
> ...
> In the simplest case, a spelled UnwrappedLine contains one macro, and after expanding it we have one expanded UnwrappedLine.
> In general, macro expansions can span UnwrappedLines, and multiple macros can contribute tokens to the same line.
> We keep consuming expanded lines until:
>  - all expansions that started have finished (we're not chopping any macros in half)
>  - *and* we've reached the end of a *spelled* unwrapped line
> A single UnwrappedLine represents this chunk of code.
> 
> After this point, the state of the spelled/expanded stream is "in sync" (both at the start of an UnwrappedLine, with no macros open), so the Unexpander can be thrown away and parsing can continue.
> 
> (and then launch into an example)
> 
Thanks, that's a really good write-up!

================
Comment at: clang/lib/Format/Macros.h:192
+  /// token.
+  UnwrappedLine getResult();
+
----------------
sammccall wrote:
> nit: giving `getResult()` a side-effect but also making it idempotent is a bit clever/confusing.
> 
> Either exposing `void finalize();` + `UnwrappedLine get() const`, or `UnwrappedLine takeResult() &&`, seems a little more obvious.
Done.

================
Comment at: clang/lib/Format/Macros.h:218
+  enum UnexpanderState {
+    Start,      // No macro expansion was found in the input yet.
+    InProgress, // During a macro unexpansion.
----------------
sammccall wrote:
> you have this "no expansions are open, but we already didn't find any" state.
> The effect of this is that finished() returns false so the caller keeps looping.
> 
> But a correct caller will never rely on this:
>  - the first line a caller feeds must have macro tokens in it, or our output will be garbage AFAICT
>  - calling getResult() without feeding in any lines is definitely not correct
> 
> It seems we could rather assert on these two conditions, and eliminate the Start/InProgress distinction.
> That way incorrect usage is an assertion crash, rather than turning into an infinite loop.
Changed to asserts.

================
Comment at: clang/lib/Format/Macros.h:278
+  // once we're past the comma in the unexpansion.
+  llvm::DenseMap<FormatToken *, FormatToken *> TokenToParentInOutput;
+
----------------
sammccall wrote:
> ParentInExpandedToParentInUnexpanded?
> 
> (current name implies that it maps a token to *its* parent. It also uses the input/output names, rather than expanded/unexpanded - it would be nice to be consistent)
Called it SpelledParentToReconstructedParent.

================
Comment at: clang/lib/Format/Macros.h:286
+    // Our current position in the unexpansion.
+    std::list<UnwrappedLineNode>::iterator I;
+    // The end of the unexpanded token sequence.
----------------
sammccall wrote:
> consider calling these NextSpelled and EndSpelled to be really explicit?
> Since the type doesn't really clarify which sequence is being referred to.
Called them SpelledI and SpelledE in llvm tradition.

================
Comment at: clang/lib/Format/Macros.h:314
+  // \- )
+  llvm::SmallVector<MacroCallState, 4> MacroCallStructure;
+
----------------
sammccall wrote:
> nit: if you're going to specify SmallVector sizes, I don't understand why you'd size Unexpanded vs MacroCallStructure differently - they're going to be the same size, right?
> 
> These days you can leave off the size entirely though, and have it pick a value
I did not know that, that's awesome!

================
Comment at: clang/lib/Format/Macros.h:142
+/// When getting the formatted lines of the expansion via the \c addLine method:
+/// -> class A {
+/// -> public:
----------------
sammccall wrote:
> klimek wrote:
> > klimek wrote:
> > > sammccall wrote:
> > > > sammccall wrote:
> > > > > I'm a bit confused by these arrows. It doesn't seem that they each point to an unwrappedline passed to addLine?
> > > > This example didn't really help me understand the interface of this class, it seems to be a special case:
> > > >  - the input is a single block construct (rather than e.g. a whole program), but it's not clear why that's the case.
> > > >  - the output (unexpanded form) consists of exactly a macro call with no leading/trailing tokens, which isn't true in general
> > > > 
> > > > If the idea is to provide as input the minimal range of lines that span the macro, we should say so somewhere. But I would like to understand why we're not simply processing the whole file.
> > > Why not? That is the intention? Note that high-level we do not pass class definitions in one unwrapped line; if they ever go into a single line it's done in the end by a special merging pass (yes, that's a bit confusing).
> > Re: input is a single construct - that is not the case (see other comment)
> > Re: leading / trailing tokens: I didn't want to make it too complex in the example.
> > Re: input is a single construct - that is not the case (see other comment)
> 
> A class is definitely a single construct :-) It sounds like that's not significant to the MacroUnexpander though, so it feels like a red herring to me.
> 
> > Re: leading / trailing tokens: I didn't want to make it too complex in the example.
> That seems fine, I think the complexities of the general case need to be mentioned somewhere because the API reflects them. But you're right, the primary example should be simple.
> I think a tricky example (maybe the `#define M ; x++` one?) could be given on one of addLine/finish/getResult maybe.
> 
#define M ; x++ seems to be similarly tricky to #define CLASSA(x) class A x
We get multiple calls to addLine, in between which finished() is false.

================
Comment at: clang/unittests/Format/MacroUnexpanderTest.cpp:207
+  Unexp.addLine(line(E.consume(lex("};"))));
+  EXPECT_TRUE(Unexp.finished());
+  Matcher U(Call);
----------------
sammccall wrote:
> klimek wrote:
> > sammccall wrote:
> > > you have lots of assertions that this is true, but none that it is false
> > Yeah, that's a white-box assertion - finished() is false the vast majority of the time, so testing the true cases is the important part - happy to add tests if you think it's worth it.
> Yes - a basic test that it's not always true would be useful I think (maybe the `#define M ;x++` case would be useful for showing the expected loop and finished() values)
We got a couple of tests like that, added checks for negative finished in between.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88299/new/

https://reviews.llvm.org/D88299