[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Sun Jul 31 13:20:10 PDT 2011

On Sun, Jul 31, 2011 at 1:03 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>>> So I've got a couple questions.
>>>
>>> Is the lexer really the appropriate place to be doing this? Originally CodeGenModule::GetStringForStringLiteral seemed like the thing I should be modifying, but I discovered that the string literal's bytes had already been zero extended by the time it got there. Would it be reasonable for the StringLiteralParser to just produce a UTF-8 encoded internal representation of the string and leave producing the final representation until later? I think the main complication with that is that I'll have to encode UCNs with their UTF-8 representation.
>>
>> Given the possibility of character escapes which can't be represented
>> in UTF-8, I'm not sure we can...
>
> Yeah, I see that's correct now. I need a way to discriminate between "\xF0\x9F\x9A\x80" and U"\xF0\x9F\x9A\x80" as well.
>
> Perhaps instead the internal representation could be a discriminated union, based on the string literal's Kind or CharByteWidth?
>
> If the final representation does have to be computed inside the string literal parser I'll need to get the target's endianess. I looked through the definition for the TargetInfo object the StringLiteralParser has but didn't see a way to do this. Is this info accessible during this phase?

A string is an array of CharByteWidth-size integers; just keeping it
in the native endianness of the compiler and letting the stuff below
clang byteswap if necessary should be sufficient.  (Granted, IIRC
clang IRGen doesn't really handle wide strings in a very intuitive
manner at the moment.)  Pretty sure there's some way to get endianness
off the TargetInfo if you really need it, though; at the very least,
it's in the target data layout string.

>>> I assume eventually someone will want source and execution charset configuration, but for now I'm content to assume source is UTF-8 and that that the execution character sets are UTF-8, UTF-16, and UTF-32, with the target's native endianess. Is that good enough for now?
>>
>> The C execution character set can't be UTF-16 or UTF-32 given 8-bit
>> char's.  But yes, feel free to assume the source and execution
>> charsets are UTF-8 for the moment.  (Windows is the only interesting
>> platform where this isn't the case normally.)
>
> Well, by execution charset I just meant the literal's representation at execution time, so there'd be an 'execution charset' for each string literal type. Perhaps this isn't the right terminology.

Oh, yes, that's fine.  That term (or at least very similar ones) has
pretty specific definition in the C standard.

-Eli