[cfe-commits] [patch] Unicode character literals for UTF-8 source encoding

Mon Jan 9 20:05:35 PST 2012

Updated patches. There's an extra one for the change to ActOnCharacterConstant.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Adds-support-for-Unicode-character-literals.patch
Type: application/octet-stream
Size: 11411 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20120109/95b94a3e/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-Fix-char-literal-types-in-C.patch
Type: application/octet-stream
Size: 1501 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20120109/95b94a3e/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-add-tests-for-unicode-character-literals.patch
Type: application/octet-stream
Size: 1629 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20120109/95b94a3e/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0004-stop-claiming-unicode-escape-sequences-are-too-long-.patch
Type: application/octet-stream
Size: 862 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20120109/95b94a3e/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0005-update-existing-tests-for-unicode-character-literal-.patch
Type: application/octet-stream
Size: 3625 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20120109/95b94a3e/attachment-0004.obj>
-------------- next part --------------

On Jan 9, 2012, at 10:12 PM, Eli Friedman wrote:

> On Mon, Jan 9, 2012 at 6:47 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>> 
>> On Jan 9, 2012, at 7:47 PM, Eli Friedman wrote:
>> 
>>> On Sun, Jan 8, 2012 at 1:00 AM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>>>> Here's a patch that improves support for Unicode character literals.
>>>> 
>>>> * adds errors for multiple characters in Unicode character literals and for char16_t character literals where the value can't be represented in 16 bits (2.14.3 p2)
>>>> 
>>>> * allows unicode escapes in character and string literals to represent control characters and basic source characters (2.3 p2)
>>>> 
>>>> * treats valid UTF-8 encoded code points as single c-chars so that these are no longer counted as multi-chars. narrow character literals will probably get warnings about the character value being too large for the type, Unicode and wide character literals will get the correct Unicode codepoint value if it can be represented.
>>>> 
>>>> * added error for invalid source encodings of character literals.
>>>> 
>>>> The patch builds without warnings in xcode and with llvm make, and, after applying my changes to the tests, make test in the clang directory passes.
>>> 
>>> +  // FIXME: unify the logic for determining the type of the char literal
>>> +  //  instead of repeating it here and in ActOnCharacterConstant
>>> +  int available_bits;
>>> +  if (!PP.getLangOptions().CPlusPlus)
>>> +    available_bits = PP.getTargetInfo().getIntWidth();
>>> +  else if (tok::wide_char_constant == Kind)
>>> +    available_bits = PP.getTargetInfo().getWCharWidth();
>>> +  else if (tok::utf16_char_constant == Kind)
>>> +    available_bits = PP.getTargetInfo().getChar16Width();
>>> +  else if (tok::utf32_char_constant == Kind)
>>> +    available_bits = PP.getTargetInfo().getChar32Width();
>>> +  else if (isMultiChar())
>>> +    available_bits = PP.getTargetInfo().getIntWidth();
>>> +  else
>>> +    available_bits = PP.getTargetInfo().getCharWidth();
>>> 
>>> Ugh... given layering, I don't see any good way around copy-pasting
>>> this... but it's worth mentioning that this logic is wrong for C.  Per
>>> the standard, "A wide character constant prefixed by the letter L has
>>> type wchar_t..."
>> 
>> Should I change it, possibly to:
>> 
>>  int available_bits;
>>  if (tok::wide_char_constant == Kind)
>>    available_bits = PP.getTargetInfo().getWCharWidth();
>>  else if (!PP.getLangOptions().CPlusPlus)
>>    available_bits = PP.getTargetInfo().getIntWidth();
>>  ...
> 
> C11 says "a wide character constant prefixed by the letter u or U has
> type char16_t or char32_t, respectively", so you need to push it down
> a bit more.  But yes, that's the right idea.
> 
>> ?
>> 
>> How does getWCharWidth() even work when wchar_t is just a library typdef? I guess the target info must be made to reflect whatever the stddef.h being used says?
> 
> Yes, exactly.  Otherwise you couldn't pass an L"" to an API expecting
> a wchar_t*.
> 
>> Should I also update ActOnCharacterConstant?
> 
> Yes, please.
> 
>>> 
>>> +  // Check UCN constraints (C99 6.4.3p2, C++03 2.2 p2)
>>> 
>>> Please use standard references of the form [lex.charset] for C++.
>>> 
>>> +  // C++ allows UCNs that refer to control characters and basic source
>>> +  // characters inside character and string literals
>>> +  if (!Features.CPlusPlus || !in_char_string_literal) {
>>> 
>>> UCNs referring to control characters are only allowed in C++11.
>>> 
>> 
>> Okay, I rewrote that bit.
>> 
>> What about the restriction to values less than 0x10FFFF. None of the standards mention that limit, but I guess it's kind of implicit when they say that UCNs designate the character with that name in ISO/IEC 10646, and ISO/IEC doesn't have any characters above that value. But using the same reasoning we could disallow, e.g. the last two values in every plane, U+FFFE U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, etc. and maybe more besides. For comparison, GCC 4.5 does not limit UCNs to below U+10FFFF, except in circumstances where it causes a problem later, such as trying to convert such illegal values to UTF-16.
> 
> We could; I would say it's probably not worth bothering.
> 
> -Eli