[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Sun Jul 31 13:03:01 PDT 2011

On Jul 31, 2011, at 3:48 AM, Eli Friedman wrote:

> On Sat, Jul 30, 2011 at 11:47 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>> I'd like to make sure use cases such as the following work.
>> 
>>> #include <iostream>
>>> #include <string>
>>> #include <locale>
>>> #include <codecvt>
>>> 
>>> // we have to make the destructor public for wstring_convert
>>> template<typename I,typename E,typename S>
>>> class codecvt : public std::codecvt<I,E,S>
>>> {
>>> public:
>>>       ~codecvt() {}
>>> };
>>> 
>>> int main (void) {
>>>       std::wstring_convert<codecvt<char16_t,char,mbstate_t>,char16_t> convert;
>>>       std::cout << convert.to_bytes(u"ⅯⅯⅪ 🚀 円形パターン")
>>>               << std::endl;
>>> }
>> 
>> To that end I took a stab at it today and came up with the following.
>> 
>> 
>> 
>> 
>> So I've got a couple questions.
>> 
>> Is the lexer really the appropriate place to be doing this? Originally CodeGenModule::GetStringForStringLiteral seemed like the thing I should be modifying, but I discovered that the string literal's bytes had already been zero extended by the time it got there. Would it be reasonable for the StringLiteralParser to just produce a UTF-8 encoded internal representation of the string and leave producing the final representation until later? I think the main complication with that is that I'll have to encode UCNs with their UTF-8 representation.
> 
> Given the possibility of character escapes which can't be represented
> in UTF-8, I'm not sure we can...

Yeah, I see that's correct now. I need a way to discriminate between "\xF0\x9F\x9A\x80" and U"\xF0\x9F\x9A\x80" as well.

Perhaps instead the internal representation could be a discriminated union, based on the string literal's Kind or CharByteWidth?

If the final representation does have to be computed inside the string literal parser I'll need to get the target's endianess. I looked through the definition for the TargetInfo object the StringLiteralParser has but didn't see a way to do this. Is this info accessible during this phase?

> 
>> If a string literal includes some invalid bytes is the right thing to do to just use the unicode replacement character (U+FFFD) and issue a warning? This would mean that every byte in a string could require four bytes in the internal representation, and it'd probably take a custom routine to do the Unicode encoding.
> 
> We probably want to issue an error if the encoding of the file isn't
> valid... it indicates the file is either messed up or isn't using the
> encoding we think it is.

Okay, that would be simpler and safer.

> 
>> The patch uses a function for converting between encodings based on iconv because that's what I had laying around, but I don't think that's going to work for a real patch. Any recommendations as to what should be used instead?
> 
> include/clang/Basic/ConvertUTF.h .
> 
>> I assume eventually someone will want source and execution charset configuration, but for now I'm content to assume source is UTF-8 and that that the execution character sets are UTF-8, UTF-16, and UTF-32, with the target's native endianess. Is that good enough for now?
> 
> The C execution character set can't be UTF-16 or UTF-32 given 8-bit
> char's.  But yes, feel free to assume the source and execution
> charsets are UTF-8 for the moment.  (Windows is the only interesting
> platform where this isn't the case normally.)

Well, by execution charset I just meant the literal's representation at execution time, so there'd be an 'execution charset' for each string literal type. Perhaps this isn't the right terminology.

> 
>> @@ -1001,6 +1049,15 @@ void StringLiteralParser::init(const Token
>> *StringToks, unsigned NumStringToks){
>>         if (CharByteWidth == 1) {
>>           memcpy(ResultPtr, InStart, Len);
>>           ResultPtr += Len;
>> +        } else if(isUTF16()) {
>> +          ResultPtr += convert_to_UTF16(InStart,Len,ResultPtr);
>> +        } else if(isUTF32()) {
>> +          ResultPtr += convert_to_UTF32(InStart,Len,ResultPtr);
>> +        }
>> +        /*
>> +        else if(isWide()) {
>> +          ResultPtr += convert_to_ExecutionWideCharset(InStart,Len,ResultPtr);
>> +         */
>>         } else {
>>           // Note: our internal rep of wide char tokens is always
>> little-endian.
>>           for (; Len; --Len, ++InStart) {
> 
> Perhaps something more like "else if (CharByteWidth == 2) [...]" etc.
> would be more appropriate?  We currently assume wchar_t is UTF_16 or
> UTF-32 anyway.
> 
> -Eli