[PATCH] D123064: [Clang][C++23] P2071 Named universal character escapes

Fri Jun 24 15:55:49 PDT 2022

tahonermann accepted this revision as: tahonermann.
tahonermann added a comment.
This revision is now accepted and ready to land.

> @tahonermann gentle ping (Aaron told me you might have further comments)

I'm sorry for the delay. I ran out of time to do the thorough review I would have liked to do, but I did scroll through everything now and did not find anything concerning; Aaron clearly conducted a thorough review already. It looks great to me, really nice work!

================
Comment at: clang/lib/Lex/Lexer.cpp:3255-3260
+    if (!isAlphanumeric(C) && C != '_' && C != '-' && C != ' ')
+      break;
+
+    if ((C < 'A' || C > 'Z') && !llvm::isDigit(C) && C != ' ' && C != '-') {
+      Invalid = true;
+    }
----------------
cor3ntin wrote:
> tahonermann wrote:
> > It isn't clear to me why there are two separate `if` statements here. I would expect the following to suffice. If I'm misunderstanding something, then comments might be helpful.
> >   if (!isUppercase(C) && !isDigit(C) && C != '-' && C != ' ') {
> >     Invalid = true;
> >     break;
> >   }
> > 
> > I'm not sure why there is a test for '_' since that character is not present in the grammar for `n-char` in P2071.
> > 
> > Is it intentional that there is no `break` statement in the second `if` statement?
> I improved that, what it does should be more clear now. More importantly, I added a diagnostic note when we detect a loose match.
> 
> We allow `_` because  Unicode does.
> We first perform a strict match - which fails as no Unicode name contains an underscore, we emit a diagnostic, and then we try a loose matching which does allow `_`.
> This enable us to produces better diagnostics
> 
> ```
> <source>:2:18: error: 'GREEK_SMALL_LETTER-OMICRON' is not a valid Unicode character name
>     const char* \N{GREEK_SMALL_LETTER-OMICRON} = "\N{zero width no break space}"; 
>                  ^
> <source>:2:20: note: characters names in Unicode escape sequences are sensitive to case and whitespaces
>     const char* \N{GREEK_SMALL_LETTER-OMICRON} = "\N{zero width no break space}"; 
>                    ^~~~~~~~~~~~~~~~~~~~~~~~~~
>                    GREEK SMALL LETTER OMICRON
> <source>:2:54: error: 'zero width no break space' is not a valid Unicode character name
>     const char* \N{GREEK_SMALL_LETTER-OMICRON} = "\N{zero width no break space}"; 
>                                                      ^~~~~~~~~~~~~~~~~~~~~~~~~
> <source>:2:54: note: characters names in Unicode escape sequences are sensitive to case and whitespaces
>     const char* \N{GREEK_SMALL_LETTER-OMICRON} = "\N{zero width no break space}"; 
>                                                      ^~~~~~~~~~~~~~~~~~~~~~~~~
>                                                      ZERO WIDTH NO-BREAK SPACE```
> 
That's great! Very nice!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123064/new/

https://reviews.llvm.org/D123064