[cfe-commits] [patch] Unicode character literals for UTF-8 source encoding

Tue Jan 10 04:05:40 PST 2012

whoops, that should be "anything that indicates '\U0010FFFD' isn't perfectly valid"

Accepting larger Unicode escapes is not new with this patch (I tried the clang installed with Xcode 4.2, Apple clang version 3.0 (tags/Apple/clang-211.12) (based on LLVM 3.0svn), and `int i = '\U001F306';` gives i the value 0x001F306. Although I don't have a use-case or anything my preference is to allow the larger unicode escapes.

If you want them excluded just let me know the ranges.

- Seth

On Jan 10, 2012, at 2:34 AM, Eli Friedman wrote:

> On Mon, Jan 9, 2012 at 9:31 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>> I don't see anything that indicates '\U0010FFFD' perfectly valid. Does 'implementation defined' leave enough room for producing an error?
> 
> Yes.
> 
>> We'll already produce a warning in the case that this is assigned to a char.
>> 
>> But it seems ambiguous as to what the value should be. C99 says:
>> 
>>>       • If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.
> 
> I think that's only supposed to apply to the case where there's a
> single char that fits into a single char.
> 
>> It seems more intuitive to just leave the integer's value alone. GCC's behavior is inscrutable to me though: '\U0010FFFD' == 0xf48fbfbd
> 
> It looks like gcc is converting to UTF-8 and stuffing the resulting
> four bytes into the int.
> 
> -Eli