[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Sun Jul 31 00:48:02 PDT 2011

On Sat, Jul 30, 2011 at 11:47 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
> I'd like to make sure use cases such as the following work.
>
>> #include <iostream>
>> #include <string>
>> #include <locale>
>> #include <codecvt>
>>
>> // we have to make the destructor public for wstring_convert
>> template<typename I,typename E,typename S>
>> class codecvt : public std::codecvt<I,E,S>
>> {
>> public:
>>       ~codecvt() {}
>> };
>>
>> int main (void) {
>>       std::wstring_convert<codecvt<char16_t,char,mbstate_t>,char16_t> convert;
>>       std::cout << convert.to_bytes(u"ⅯⅯⅪ 🚀 円形パターン")
>>               << std::endl;
>> }
>
> To that end I took a stab at it today and came up with the following.
>
>
>
>
> So I've got a couple questions.
>
> Is the lexer really the appropriate place to be doing this? Originally CodeGenModule::GetStringForStringLiteral seemed like the thing I should be modifying, but I discovered that the string literal's bytes had already been zero extended by the time it got there. Would it be reasonable for the StringLiteralParser to just produce a UTF-8 encoded internal representation of the string and leave producing the final representation until later? I think the main complication with that is that I'll have to encode UCNs with their UTF-8 representation.

Given the possibility of character escapes which can't be represented
in UTF-8, I'm not sure we can...

> If a string literal includes some invalid bytes is the right thing to do to just use the unicode replacement character (U+FFFD) and issue a warning? This would mean that every byte in a string could require four bytes in the internal representation, and it'd probably take a custom routine to do the Unicode encoding.

We probably want to issue an error if the encoding of the file isn't
valid... it indicates the file is either messed up or isn't using the
encoding we think it is.

> The patch uses a function for converting between encodings based on iconv because that's what I had laying around, but I don't think that's going to work for a real patch. Any recommendations as to what should be used instead?

include/clang/Basic/ConvertUTF.h .

> I assume eventually someone will want source and execution charset configuration, but for now I'm content to assume source is UTF-8 and that that the execution character sets are UTF-8, UTF-16, and UTF-32, with the target's native endianess. Is that good enough for now?

The C execution character set can't be UTF-16 or UTF-32 given 8-bit
char's.  But yes, feel free to assume the source and execution
charsets are UTF-8 for the moment.  (Windows is the only interesting
platform where this isn't the case normally.)

@@ -1001,6 +1049,15 @@ void StringLiteralParser::init(const Token
*StringToks, unsigned NumStringToks){
         if (CharByteWidth == 1) {
           memcpy(ResultPtr, InStart, Len);
           ResultPtr += Len;
+        } else if(isUTF16()) {
+          ResultPtr += convert_to_UTF16(InStart,Len,ResultPtr);
+        } else if(isUTF32()) {
+          ResultPtr += convert_to_UTF32(InStart,Len,ResultPtr);
+        }
+        /*
+        else if(isWide()) {
+          ResultPtr += convert_to_ExecutionWideCharset(InStart,Len,ResultPtr);
+         */
         } else {
           // Note: our internal rep of wide char tokens is always
little-endian.
           for (; Len; --Len, ++InStart) {

Perhaps something more like "else if (CharByteWidth == 2) [...]" etc.
would be more appropriate?  We currently assume wchar_t is UTF_16 or
UTF-32 anyway.

-Eli