[PATCH] D138861: [Clang] Implement CWG2640 Allow more characters in an n-char sequence

Mon Dec 12 09:02:45 PST 2022

tahonermann accepted this revision.
tahonermann added a comment.
This revision is now accepted and ready to land.

This looks good. I'm accepting the change despite one remaining issue that was neither introduced nor addressed by this patch; the diagnostic caret will still be in the wrong location for many of the diagnostics issued by `tryReadNamedUCN()`. Most of them use `StartPtr` which we've established points to `N` instead of `\`. If you want to make an additional update to address that, I'll be happy to review again. I understand if you would rather that be fixed via a separate change.

================
Comment at: clang/lib/Lex/Lexer.cpp:3323
     if (Diagnose)
       Diag(StartPtr, diag::warn_ucn_escape_incomplete);
     return llvm::None;
----------------
tahonermann wrote:
> I was able to confirm that `StartPtr` does point to `N`. See the diagnostic generated at https://godbolt.org/z/qnajcMeso; the diagnostic caret points to `N` instead of `\`.
>   <source>:1:2: warning: incomplete delimited universal character name; treating as '\' 'N' '{' identifier [-Wunicode]
>   \N{abc
>    ^
I think this still needs to be addressed. `tryReadNumericUCN()` handles this case by requiring that callers pass the location of the `\` as `SlashLoc`. I guess this is technically required since the `\` character could be written as either `\` or the `??/` trigraph.

================
Comment at: clang/lib/Lex/Lexer.cpp:3386
+
+  if (!Result || CurPtr - StartPtr == (ptrdiff_t)(Buffer.size() + 4))
     StartPtr = CurPtr;
----------------
cor3ntin wrote:
> tahonermann wrote:
> > cor3ntin wrote:
> > > tahonermann wrote:
> > > > cor3ntin wrote:
> > > > > cor3ntin wrote:
> > > > > > tahonermann wrote:
> > > > > > > This is a bit of a tangent, but under what circumstances would `CurPtr - StartPtr` not be equal to `Buffer.size() + 4`? Actually, I'm not sure that +4 is correct. It looks like `StartPtr` is expected to point to `N` at the beginning of the function, so the initial `\` is not included in the range. If `N` isn't seen, the function returns early. Likewise, if either of the `{` or `}` delimiters is not found, the function returns early.
> > > > > > > 
> > > > > > > I think the goal of this code is to skip the `getAndAdvanceChar()` machinery when the buffer is being claimed (no need to parse UCNs or trigraphs in the character name), but it looks like, particularly with this DR, that we might always be able to do that.
> > > > > > afaict, 4 is correct here because we are one past-the-end.
> > > > > > I do however agree that this whole pattern which is used a few times is probably unnecessary, i do think it would be good to investigate. Not in this patch though, imo
> > > > > I looked into it, I'm sure we could improve but not easily, `getAndAdvanceChar` does set some flags on the token in the presence of trigraphs/line splicing and we need those flags to be set, this is the reason we do need to call that method.
> > > > > It's not super efficient but it's such an edge case... I'd rather not touch that now
> > > > My concern is that, as is, the code always takes the `else` branch (except when `Result` is non-null). Assuming a buffer containing "X", the pointers are arranged as follows (where `$` is one past the end).
> > > >   \ N { X } $
> > > >     |   |   `- CurPtr
> > > >     |   `- Buffer
> > > >     `- StartPtr
> > > > `CurPtr - StartPtr` is 4, but `Buffer.size() + 4` is 5 (`Buffer.size()` is 1 in this case).
> > > > 
> > > > I think there might be an easy test to see if this is working as intended. If it isn't, I would expect a diagnostic to be issued if trigraphs are enabled and the buffer contains a trigraph sequence. Something like:
> > > >   \N{LOTUS??>}
> > > I can try to add tests
> > > 
> > > > My concern is that, as is, the code always takes the else branch (except when Result is non-null). 
> > > Yes, the if branch sole purpose is to set some state in Result.
> > > 
> > > At the start of the function, StartPtr points to `\`
> > > And I'll leave a comment, maybe that will clear up future confusions
> > > 
> > > There may be a potential refactor here, which is to have `getAndAdvanceChar` take a `bool & ShouldCleanupToken` parameter instead of a token so that we don't have to do that dance, but it's a bit outside of the scope of this patch...
> > > At the start of the function, StartPtr points to `\`
> > 
> > It doesn't look like it does. The first use of `StartPtr` is at line 3314 and it expects to read `N`:
> >   3314:   char C = getCharAndSize(StartPtr, CharSize);
> >   3315:   assert(C == 'N' && "expected \\N{...}");
> Yes, you are right. There was a bug in \u too, probably has been there for ages.
> It's unfortunately not testable, any incorrect value would call getCharAndSize which is going to do the right thing.
Thank you; the added comment makes this more clear. Basically, we only want to skip the call to `getAndAdvanceChar()` when we're sure the token wasn't spelled with trigraphs or line spaces.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138861/new/

https://reviews.llvm.org/D138861