[cfe-commits] [patch] Unicode character literals for UTF-8 source encoding

Mon Jan 9 18:47:28 PST 2012

On Jan 9, 2012, at 7:47 PM, Eli Friedman wrote:

> On Sun, Jan 8, 2012 at 1:00 AM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>> Here's a patch that improves support for Unicode character literals.
>> 
>> * adds errors for multiple characters in Unicode character literals and for char16_t character literals where the value can't be represented in 16 bits (2.14.3 p2)
>> 
>> * allows unicode escapes in character and string literals to represent control characters and basic source characters (2.3 p2)
>> 
>> * treats valid UTF-8 encoded code points as single c-chars so that these are no longer counted as multi-chars. narrow character literals will probably get warnings about the character value being too large for the type, Unicode and wide character literals will get the correct Unicode codepoint value if it can be represented.
>> 
>> * added error for invalid source encodings of character literals.
>> 
>> The patch builds without warnings in xcode and with llvm make, and, after applying my changes to the tests, make test in the clang directory passes.
> 
> +  // FIXME: unify the logic for determining the type of the char literal
> +  //  instead of repeating it here and in ActOnCharacterConstant
> +  int available_bits;
> +  if (!PP.getLangOptions().CPlusPlus)
> +    available_bits = PP.getTargetInfo().getIntWidth();
> +  else if (tok::wide_char_constant == Kind)
> +    available_bits = PP.getTargetInfo().getWCharWidth();
> +  else if (tok::utf16_char_constant == Kind)
> +    available_bits = PP.getTargetInfo().getChar16Width();
> +  else if (tok::utf32_char_constant == Kind)
> +    available_bits = PP.getTargetInfo().getChar32Width();
> +  else if (isMultiChar())
> +    available_bits = PP.getTargetInfo().getIntWidth();
> +  else
> +    available_bits = PP.getTargetInfo().getCharWidth();
> 
> Ugh... given layering, I don't see any good way around copy-pasting
> this... but it's worth mentioning that this logic is wrong for C.  Per
> the standard, "A wide character constant prefixed by the letter L has
> type wchar_t..."

Should I change it, possibly to:

  int available_bits;
  if (tok::wide_char_constant == Kind)
    available_bits = PP.getTargetInfo().getWCharWidth();
  else if (!PP.getLangOptions().CPlusPlus)
    available_bits = PP.getTargetInfo().getIntWidth();
  ...

?

How does getWCharWidth() even work when wchar_t is just a library typdef? I guess the target info must be made to reflect whatever the stddef.h being used says? Should I also update ActOnCharacterConstant?

> 
> +  // Check UCN constraints (C99 6.4.3p2, C++03 2.2 p2)
> 
> Please use standard references of the form [lex.charset] for C++.
> 
> +  // C++ allows UCNs that refer to control characters and basic source
> +  // characters inside character and string literals
> +  if (!Features.CPlusPlus || !in_char_string_literal) {
> 
> UCNs referring to control characters are only allowed in C++11.
> 

Okay, I rewrote that bit.

What about the restriction to values less than 0x10FFFF. None of the standards mention that limit, but I guess it's kind of implicit when they say that UCNs designate the character with that name in ISO/IEC 10646, and ISO/IEC doesn't have any characters above that value. But using the same reasoning we could disallow, e.g. the last two values in every plane, U+FFFE U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, etc. and maybe more besides. For comparison, GCC 4.5 does not limit UCNs to below U+10FFFF, except in circumstances where it causes a problem later, such as trying to convert such illegal values to UTF-16.

  // Check UCN constraints (C99 6.4.3p2) [C++03/11 lex.charset p2]
  bool invalid_ucn = 0x10FFFF < UcnVal; // maximum legal UTF32 value

  // C++03 does not restrict surrogate codepoints
  if (Features.CPlusPlus && !Features.CPlusPlus0x)
    invalid_ucn = (0xD800<=UcnVal && UcnVal<=0xDFFF);

  // C++11 allows UCNs that refer to control characters and basic source
  // characters inside character and string literals
  if (!Features.CPlusPlus0x || !in_char_string_literal) {
    if ((UcnVal < 0xa0 &&
         (UcnVal != 0x24 && UcnVal != 0x40 && UcnVal != 0x60 ))) {  // $, @, `
      invalid_ucn = true;
    }
  }

  if (invalid_ucn) {
    if (Diags)
      Diags->Report(Loc, diag::err_ucn_escape_invalid);
    return false;
  }
  return true;

> Otherwise, this is looking good!
> 
> -Eli
> 
> -Eli