[PATCH] D119221: [clang][lexer] Allow u8 character literal prefixes in C2x

Tue Apr 5 14:12:44 PDT 2022

aaron.ballman added inline comments.

================
Comment at: clang/lib/Lex/Lexer.cpp:3462

-  case 'u':   // Identifier (uber) or C11/C++11 UTF-8 or UTF-16 string literal
+  // Identifer (e.g., uber), or
+  // UTF-8 (C2x/C++17) or UTF-16 (C11/C++11) character literal, or
----------------

================
Comment at: clang/test/Lexer/utf8-char-literal.cpp:3
 // RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c11 -x c -fsyntax-only -verify %s
+// RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c2x -DC2X -x c -fsyntax-only -verify %s
 // RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c++1z -fsyntax-only -verify %s
----------------
no need for the new `-D`

================
Comment at: clang/test/Lexer/utf8-char-literal.cpp:16
 char f = u8'ab'; // expected-error {{Unicode character literals may not contain multiple characters}}
+#elif defined(C2X)
+char a = u8'ñ';      // expected-error {{character too large for enclosing character literal type}}
----------------
You can test with `__STDC_VERSION__ > 202000L`.

================
Comment at: clang/test/Lexer/utf8-char-literal.cpp:23
+char f = u8'ab';            // expected-error {{Unicode character literals may not contain multiple characters}}
+char g = u8'\x80';          // expected-warning {{implicit conversion from 'int' to 'char' changes value from 128 to -128}}
 #endif
----------------
tahonermann wrote:
> aaron.ballman wrote:
> > One more test I'd like to see added, just to make sure we're covering 6.4.4.4p9 properly:
> > ```
> > _Static_assert(
> >   _Generic(u8'a',
> >            default: 0,
> >            unsigned char : 1),
> >   "Surprise!");  
> > ```
> > We expect the type of a u8 character literal to be `unsigned char` at the moment, which is different from a u8 string literal, which uses `char`.
> > 
> > However, WG14 is also going to be considering http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm for C2x at our meeting next week.
> Good suggestion. I believe the following update will be needed to`Sema::ActOnCharacterConstant()` in `clang/lib/Sema/SemaExpr.cpp`:
>   ...
>   else if (Literal.isUTF8() && getLangOpts().C2x)
>     Ty = Context.UnsignedCharTy; // u8'x' -> unsigned char in c2x.
>   else if Literal.isUTF8() && getLangOpts().Char8)
>     Ty = Context.Char8Ty; // u8'x' -> char8_t when it exists.
>   ...
> 
> However, WG14 is also going to be considering http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm for C2x at our meeting next week.

I have an update on this. We discussed the paper and took a straw poll:
```
Does WG14 wish to adopt N2653 in C23? 18/0/2 (consensus)
```
So we should make sure that we all agree this patch is in line with the changes from that paper. I believe your changes agree, but it'd be nice for @tahonermann to confirm.

================
Comment at: clang/test/Lexer/utf8-char-literal.cpp:24
+char g = u8'\x80';          // expected-warning {{implicit conversion from 'int' to 'char' changes value from 128 to -128}}
 #endif
----------------
tbaeder wrote:
> tahonermann wrote:
> > We should also exercise the preprocessor with something like this:
> >   #if u8'\xff' != 0xff
> >   #error uh oh
> >   #endif
> > 
> > Hmm, this currently fails for C++20 for both Clang and gcc unless `-funsigned-char` is passed. That seems wrong. https://godbolt.org/z/Tb7z85ToG. MSVC gets this wrong too, but I think for a different reason; see the implementation impact section of [[ https://wg21.link/p2029 | P2029 ]] if curious.
> This also fails in C2x.
I don't think we need to fix the preprocessor behavior as part of this patch, but it would be good to file an issue for this so we know to track it down at some point.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119221/new/

https://reviews.llvm.org/D119221