[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Tue Aug 23 21:14:21 PDT 2011

Attached is a patch which allows UTF-16 and UTF-32 string literals to work as expected (i.e., the source string literal data is converted from the input encoding (currently always UTF-8) to UTF-16 and UTF-32, and can be accessed as such at runtime). The patch only changes how non-escaped data in string literals is handled, hex escape sequence and universal character name handling isn't changed.

Can someone take a look and let me know what needs changing to get this accepted? So far I expect I'll need to add tests:

- test all string types (u8, u8R, u, uR, U, UR, L, LR) with valid UTF-8 data, verify that the output object file contains the expected data
- test the new error using an ISO-8859-1 encoded file containing accented characters in string literal

Can anyone recommend existing tests I can look to for examples for implementing these tests? What other tests I should have? What other changes to the code are needed?

Also while working on my patch I noticed the following warning:

> test.cpp:33:20: warning: character unicode escape sequence too long for its type
>     char16_t c[] = u"\U0001F47F";
>                    ^

I found that the resulting code behaves as expected (producing the appropriate UTF-16 surrogate pair in the array). Should there really be a warning here?

Thanks,
Seth

On Jul 31, 2011, at 4:20 PM, Eli Friedman wrote:

> On Sun, Jul 31, 2011 at 1:03 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>>>> So I've got a couple questions.
>>>> 
>>>> Is the lexer really the appropriate place to be doing this? Originally CodeGenModule::GetStringForStringLiteral seemed like the thing I should be modifying, but I discovered that the string literal's bytes had already been zero extended by the time it got there. Would it be reasonable for the StringLiteralParser to just produce a UTF-8 encoded internal representation of the string and leave producing the final representation until later? I think the main complication with that is that I'll have to encode UCNs with their UTF-8 representation.
>>> 
>>> Given the possibility of character escapes which can't be represented
>>> in UTF-8, I'm not sure we can...
>> 
>> Yeah, I see that's correct now. I need a way to discriminate between "\xF0\x9F\x9A\x80" and U"\xF0\x9F\x9A\x80" as well.
>> 
>> Perhaps instead the internal representation could be a discriminated union, based on the string literal's Kind or CharByteWidth?
>> 
>> If the final representation does have to be computed inside the string literal parser I'll need to get the target's endianess. I looked through the definition for the TargetInfo object the StringLiteralParser has but didn't see a way to do this. Is this info accessible during this phase?
> 
> A string is an array of CharByteWidth-size integers; just keeping it
> in the native endianness of the compiler and letting the stuff below
> clang byteswap if necessary should be sufficient.  (Granted, IIRC
> clang IRGen doesn't really handle wide strings in a very intuitive
> manner at the moment.)  Pretty sure there's some way to get endianness
> off the TargetInfo if you really need it, though; at the very least,
> it's in the target data layout string.
> 
>>>> I assume eventually someone will want source and execution charset configuration, but for now I'm content to assume source is UTF-8 and that that the execution character sets are UTF-8, UTF-16, and UTF-32, with the target's native endianess. Is that good enough for now?
>>> 
>>> The C execution character set can't be UTF-16 or UTF-32 given 8-bit
>>> char's.  But yes, feel free to assume the source and execution
>>> charsets are UTF-8 for the moment.  (Windows is the only interesting
>>> platform where this isn't the case normally.)
>> 
>> Well, by execution charset I just meant the literal's representation at execution time, so there'd be an 'execution charset' for each string literal type. Perhaps this isn't the right terminology.
> 
> Oh, yes, that's fine.  That term (or at least very similar ones) has
> pretty specific definition in the C standard.
> 
> -Eli

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Support-converting-string-literal-data-to-the-approp.patch
Type: application/octet-stream
Size: 12888 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110824/1c898266/attachment.obj>