[PATCH] D138861: [Clang] Implement CWG2640 Allow more characters in an n-char sequence

Fri Dec 9 11:57:02 PST 2022

cor3ntin added inline comments.

================
Comment at: clang/lib/Lex/Lexer.cpp:3378-3379

-  if (LooseMatch)
+  // If no diagnostic has been emitted yet we do not want to recover here
+  // to make sure this function will be called again and a diagnostic emitted then.
+  if (LooseMatch && Diagnose)
----------------
tahonermann wrote:
> cor3ntin wrote:
> > tahonermann wrote:
> > > I'm having a hard time understanding what this comment is trying to convey. The comment response to Aaron was much more helpful to me. How about this suggested edit?
> > It's not that a diagnostic would be emitted twice, is that it would not be emitted at all.
> > By recovering - ie returning a loose match - we prevent the function to be called again on the same buffer.
> > So if during the first call Diagnose is false (because we were called from a tryXXXX function), then the compilation will continue assuming we were in fact able to parse an identifier and never informs us of an invalid name.
> More evidence of the difficulty I'm having appreciating this comment :)
> 
> Does this sound right?
>   // Only map a loose match to a code point when in a diagnosing state.
>   // If a mapped code point were to be returned in a non-diagnosing state,
>   // token caching would prevent the opportunity to issue a diagnostic when
>   // the token is later used.
> 
> Tangent: It might be worth renaming one of `Res` and `Result`. I keep getting confused due to the similar names. Perhaps rename `Result` to `Token` or `ResultToken`?
I'm not sure caching is exactly right.
There are tentative parses, that should not give false positive answers. I'll try to message a better comment

================
Comment at: clang/lib/Lex/Lexer.cpp:3386
+
+  if (!Result || CurPtr - StartPtr == (ptrdiff_t)(Buffer.size() + 4))
     StartPtr = CurPtr;
----------------
tahonermann wrote:
> cor3ntin wrote:
> > cor3ntin wrote:
> > > tahonermann wrote:
> > > > This is a bit of a tangent, but under what circumstances would `CurPtr - StartPtr` not be equal to `Buffer.size() + 4`? Actually, I'm not sure that +4 is correct. It looks like `StartPtr` is expected to point to `N` at the beginning of the function, so the initial `\` is not included in the range. If `N` isn't seen, the function returns early. Likewise, if either of the `{` or `}` delimiters is not found, the function returns early.
> > > > 
> > > > I think the goal of this code is to skip the `getAndAdvanceChar()` machinery when the buffer is being claimed (no need to parse UCNs or trigraphs in the character name), but it looks like, particularly with this DR, that we might always be able to do that.
> > > afaict, 4 is correct here because we are one past-the-end.
> > > I do however agree that this whole pattern which is used a few times is probably unnecessary, i do think it would be good to investigate. Not in this patch though, imo
> > I looked into it, I'm sure we could improve but not easily, `getAndAdvanceChar` does set some flags on the token in the presence of trigraphs/line splicing and we need those flags to be set, this is the reason we do need to call that method.
> > It's not super efficient but it's such an edge case... I'd rather not touch that now
> My concern is that, as is, the code always takes the `else` branch (except when `Result` is non-null). Assuming a buffer containing "X", the pointers are arranged as follows (where `$` is one past the end).
>   \ N { X } $
>     |   |   `- CurPtr
>     |   `- Buffer
>     `- StartPtr
> `CurPtr - StartPtr` is 4, but `Buffer.size() + 4` is 5 (`Buffer.size()` is 1 in this case).
> 
> I think there might be an easy test to see if this is working as intended. If it isn't, I would expect a diagnostic to be issued if trigraphs are enabled and the buffer contains a trigraph sequence. Something like:
>   \N{LOTUS??>}
I can try to add tests

> My concern is that, as is, the code always takes the else branch (except when Result is non-null). 
Yes, the if branch sole purpose is to set some state in Result.

At the start of the function, StartPtr points to `\`
And I'll leave a comment, maybe that will clear up future confusions

There may be a potential refactor here, which is to have `getAndAdvanceChar` take a `bool & ShouldCleanupToken` parameter instead of a token so that we don't have to do that dance, but it's a bit outside of the scope of this patch...

================
Comment at: clang/test/Preprocessor/ucn-pp-identifier.c:141
+// expected-warning at -2 {{incomplete delimited universal character name}}

 #ifdef TRIGRAPHS
----------------
tahonermann wrote:
> How about adding a test for an escaped new line? I think this is currently UB according to the standard ([[ http://eel.is/c++draft/lex.phases#1.2 | (lex.phases)p2 ]]), but I believe you are trying to change that with [[ https://wg21.link/p2621 | P2621 ]].
>   int \N{\
>   LATIN CAPITAL LETTER A WITH GRAVE};
I can add a test, sure
P2621 is not super relevant (or rather, it standardize behaviors that clang already has)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138861/new/

https://reviews.llvm.org/D138861