[PATCH] D138861: [Clang] Implement CWG2640 Allow more characters in an n-char sequence

Fri Dec 9 15:34:39 PST 2022

cor3ntin added inline comments.

================
Comment at: clang/lib/Lex/Lexer.cpp:3386
+
+  if (!Result || CurPtr - StartPtr == (ptrdiff_t)(Buffer.size() + 4))
     StartPtr = CurPtr;
----------------
tahonermann wrote:
> cor3ntin wrote:
> > tahonermann wrote:
> > > cor3ntin wrote:
> > > > cor3ntin wrote:
> > > > > tahonermann wrote:
> > > > > > This is a bit of a tangent, but under what circumstances would `CurPtr - StartPtr` not be equal to `Buffer.size() + 4`? Actually, I'm not sure that +4 is correct. It looks like `StartPtr` is expected to point to `N` at the beginning of the function, so the initial `\` is not included in the range. If `N` isn't seen, the function returns early. Likewise, if either of the `{` or `}` delimiters is not found, the function returns early.
> > > > > > 
> > > > > > I think the goal of this code is to skip the `getAndAdvanceChar()` machinery when the buffer is being claimed (no need to parse UCNs or trigraphs in the character name), but it looks like, particularly with this DR, that we might always be able to do that.
> > > > > afaict, 4 is correct here because we are one past-the-end.
> > > > > I do however agree that this whole pattern which is used a few times is probably unnecessary, i do think it would be good to investigate. Not in this patch though, imo
> > > > I looked into it, I'm sure we could improve but not easily, `getAndAdvanceChar` does set some flags on the token in the presence of trigraphs/line splicing and we need those flags to be set, this is the reason we do need to call that method.
> > > > It's not super efficient but it's such an edge case... I'd rather not touch that now
> > > My concern is that, as is, the code always takes the `else` branch (except when `Result` is non-null). Assuming a buffer containing "X", the pointers are arranged as follows (where `$` is one past the end).
> > >   \ N { X } $
> > >     |   |   `- CurPtr
> > >     |   `- Buffer
> > >     `- StartPtr
> > > `CurPtr - StartPtr` is 4, but `Buffer.size() + 4` is 5 (`Buffer.size()` is 1 in this case).
> > > 
> > > I think there might be an easy test to see if this is working as intended. If it isn't, I would expect a diagnostic to be issued if trigraphs are enabled and the buffer contains a trigraph sequence. Something like:
> > >   \N{LOTUS??>}
> > I can try to add tests
> > 
> > > My concern is that, as is, the code always takes the else branch (except when Result is non-null). 
> > Yes, the if branch sole purpose is to set some state in Result.
> > 
> > At the start of the function, StartPtr points to `\`
> > And I'll leave a comment, maybe that will clear up future confusions
> > 
> > There may be a potential refactor here, which is to have `getAndAdvanceChar` take a `bool & ShouldCleanupToken` parameter instead of a token so that we don't have to do that dance, but it's a bit outside of the scope of this patch...
> > At the start of the function, StartPtr points to `\`
> 
> It doesn't look like it does. The first use of `StartPtr` is at line 3314 and it expects to read `N`:
>   3314:   char C = getCharAndSize(StartPtr, CharSize);
>   3315:   assert(C == 'N' && "expected \\N{...}");
Yes, you are right. There was a bug in \u too, probably has been there for ages.
It's unfortunately not testable, any incorrect value would call getCharAndSize which is going to do the right thing.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138861/new/

https://reviews.llvm.org/D138861