[cfe-commits] [patch] Unicode character literals for UTF-8 source encoding

Eli Friedman eli.friedman at gmail.com
Mon Jan 9 16:47:06 PST 2012


On Sun, Jan 8, 2012 at 1:00 AM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
> Here's a patch that improves support for Unicode character literals.
>
> * adds errors for multiple characters in Unicode character literals and for char16_t character literals where the value can't be represented in 16 bits (2.14.3 p2)
>
> * allows unicode escapes in character and string literals to represent control characters and basic source characters (2.3 p2)
>
> * treats valid UTF-8 encoded code points as single c-chars so that these are no longer counted as multi-chars. narrow character literals will probably get warnings about the character value being too large for the type, Unicode and wide character literals will get the correct Unicode codepoint value if it can be represented.
>
> * added error for invalid source encodings of character literals.
>
> The patch builds without warnings in xcode and with llvm make, and, after applying my changes to the tests, make test in the clang directory passes.

+  // FIXME: unify the logic for determining the type of the char literal
+  //  instead of repeating it here and in ActOnCharacterConstant
+  int available_bits;
+  if (!PP.getLangOptions().CPlusPlus)
+    available_bits = PP.getTargetInfo().getIntWidth();
+  else if (tok::wide_char_constant == Kind)
+    available_bits = PP.getTargetInfo().getWCharWidth();
+  else if (tok::utf16_char_constant == Kind)
+    available_bits = PP.getTargetInfo().getChar16Width();
+  else if (tok::utf32_char_constant == Kind)
+    available_bits = PP.getTargetInfo().getChar32Width();
+  else if (isMultiChar())
+    available_bits = PP.getTargetInfo().getIntWidth();
+  else
+    available_bits = PP.getTargetInfo().getCharWidth();

Ugh... given layering, I don't see any good way around copy-pasting
this... but it's worth mentioning that this logic is wrong for C.  Per
the standard, "A wide character constant prefixed by the letter L has
type wchar_t..."

+  // Check UCN constraints (C99 6.4.3p2, C++03 2.2 p2)

Please use standard references of the form [lex.charset] for C++.

+  // C++ allows UCNs that refer to control characters and basic source
+  // characters inside character and string literals
+  if (!Features.CPlusPlus || !in_char_string_literal) {

UCNs referring to control characters are only allowed in C++11.

Otherwise, this is looking good!

-Eli

-Eli




More information about the cfe-commits mailing list