[cfe-commits] [patch] Unicode character literals for UTF-8 source encoding

Seth Cantrell seth.cantrell at gmail.com
Mon Jan 9 21:31:33 PST 2012


I don't see anything that indicates '\U0010FFFD' perfectly valid. Does 'implementation defined' leave enough room for producing an error? We'll already produce a warning in the case that this is assigned to a char.

But it seems ambiguous as to what the value should be. C99 says:

> 	• If an integer character constant contains a single character or escape sequence, its value is the one that results when an object with type char whose value is that of the single character or escape sequence is converted to type int.


The value of the escape sequence is 0x10FFFD, and a char can't have that value, so it makes no sense to talk about a char with that value being converted to an int. It could be taken to mean 

(int)(char)0x10FFFD;

It seems more intuitive to just leave the integer's value alone. GCC's behavior is inscrutable to me though: '\U0010FFFD' == 0xf48fbfbd

- Seth

On Jan 9, 2012, at 11:56 PM, Eli Friedman wrote:

> On Mon, Jan 9, 2012 at 8:05 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>> Updated patches. There's an extra one for the change to ActOnCharacterConstant.
>> 
> 
> +  // FIXME: unify the logic for determining the type of the char literal
> +  //  instead of repeating it here and in ActOnCharacterConstant
> +  int available_bits;
> +  if (tok::wide_char_constant == Kind)
> +    available_bits = PP.getTargetInfo().getWCharWidth();
> +  else if (tok::utf16_char_constant == Kind)
> +    available_bits = PP.getTargetInfo().getChar16Width();
> +  else if (tok::utf32_char_constant == Kind)
> +    available_bits = PP.getTargetInfo().getChar32Width();
> +  else if (!PP.getLangOptions().CPlusPlus || isMultiChar())
> +    available_bits = PP.getTargetInfo().getIntWidth();
> +  else
> +    available_bits = PP.getTargetInfo().getCharWidth();
> 
> Actually, thinking about it a bit more, I'm still not sure this is
> actually what we want to do; do we really want to allow '\U0010FFFD'
> in C?  I mean, strictly speaking, it's implementation-defined, but I
> don't think there's any precedent for the value we use with this
> patch.
> 
> -Eli





More information about the cfe-commits mailing list