[PATCH] D119221: [clang][lexer] Allow u8 character literal prefixes in C2x

Fri Feb 11 19:11:31 PST 2022

tahonermann added inline comments.

================
Comment at: clang/lib/Lex/Lexer.cpp:3462

-  case 'u':   // Identifier (uber) or C11/C++11 UTF-8 or UTF-16 string literal
+  case 'u': // Identifier (uber) or C11/C2x/C++11 UTF-8 or UTF-16 string literal
     // Notify MIOpt that we read a non-whitespace/non-comment token.
----------------
The comment is slightly misleading both before and after this change. Assuming this level of detail is desired, I suggest:
  // Identifer (e.g., uber), or
  // UTF-8 (C2x/C++17) or UTF-16 (C11/C++11) character literal, or
  // UTF-8 or UTF-16 string literal (C11/C++11).
  case 'u':

================
Comment at: clang/test/Lexer/utf8-char-literal.cpp:23
+char f = u8'ab';            // expected-error {{Unicode character literals may not contain multiple characters}}
+char g = u8'\x80';          // expected-warning {{implicit conversion from 'int' to 'char' changes value from 128 to -128}}
 #endif
----------------
aaron.ballman wrote:
> One more test I'd like to see added, just to make sure we're covering 6.4.4.4p9 properly:
> ```
> _Static_assert(
>   _Generic(u8'a',
>            default: 0,
>            unsigned char : 1),
>   "Surprise!");  
> ```
> We expect the type of a u8 character literal to be `unsigned char` at the moment, which is different from a u8 string literal, which uses `char`.
> 
> However, WG14 is also going to be considering http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm for C2x at our meeting next week.
Good suggestion. I believe the following update will be needed to`Sema::ActOnCharacterConstant()` in `clang/lib/Sema/SemaExpr.cpp`:
  ...
  else if (Literal.isUTF8() && getLangOpts().C2x)
    Ty = Context.UnsignedCharTy; // u8'x' -> unsigned char in c2x.
  else if Literal.isUTF8() && getLangOpts().Char8)
    Ty = Context.Char8Ty; // u8'x' -> char8_t when it exists.
  ...

================
Comment at: clang/test/Lexer/utf8-char-literal.cpp:24
+char g = u8'\x80';          // expected-warning {{implicit conversion from 'int' to 'char' changes value from 128 to -128}}
 #endif
----------------
We should also exercise the preprocessor with something like this:
  #if u8'\xff' != 0xff
  #error uh oh
  #endif

Hmm, this currently fails for C++20 for both Clang and gcc unless `-funsigned-char` is passed. That seems wrong. https://godbolt.org/z/Tb7z85ToG. MSVC gets this wrong too, but I think for a different reason; see the implementation impact section of [[ https://wg21.link/p2029 | P2029 ]] if curious.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D119221/new/

https://reviews.llvm.org/D119221