[cfe-commits] [patch] Unicode character literals for UTF-8 source encoding

Mon Jan 9 23:34:25 PST 2012

On Mon, Jan 9, 2012 at 9:31 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
> I don't see anything that indicates '\U0010FFFD' perfectly valid. Does 'implementation defined' leave enough room for producing an error?

Yes.

> We'll already produce a warning in the case that this is assigned to a char.
>
> But it seems ambiguous as to what the value should be. C99 says:
>
>>       • If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.

I think that's only supposed to apply to the case where there's a
single char that fits into a single char.

> It seems more intuitive to just leave the integer's value alone. GCC's behavior is inscrutable to me though: '\U0010FFFD' == 0xf48fbfbd

It looks like gcc is converting to UTF-8 and stuffing the resulting
four bytes into the int.

-Eli