[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Wed Aug 31 21:53:21 PDT 2011

On Wed, Aug 31, 2011 at 9:14 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>
> On Aug 29, 2011, at 2:50 PM, Eli Friedman wrote:
>
>> On Sun, Aug 28, 2011 at 6:26 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>>> Here's a set of patches updated as per your comments.
>>
>> @@ -1119,20 +1133,37 @@ void StringLiteralParser::init(const Token
>> *StringToks, unsigned NumStringToks){
>>
>> /// copyStringFragment - This function copies from Start to End into ResultPtr.
>> /// Performs widening for multi-byte characters.
>> -void StringLiteralParser::CopyStringFragment(StringRef Fragment) {
>> +bool StringLiteralParser::CopyStringFragment(StringRef Fragment) {
>> +  assert(CharByteWidth==1 || CharByteWidth==2 || CharByteWidth==4);
>> +  ConversionResult result = conversionOK;
>>   // Copy the character span over.
>>   if (CharByteWidth == 1) {
>>     memcpy(ResultPtr, Fragment.data(), Fragment.size());
>>     ResultPtr += Fragment.size();
>> -  } else {
>> -    // Note: our internal rep of wide char tokens is always little-endian.
>> -    for (StringRef::iterator I=Fragment.begin(), E=Fragment.end(); I!=E; ++I) {
>> -      *ResultPtr++ = *I;
>> -      // Add zeros at the end.
>> -      for (unsigned i = 1, e = CharByteWidth; i != e; ++i)
>> -        *ResultPtr++ = 0;
>> -    }
>> +  } else if (CharByteWidth == 2) {
>> +    UTF8 const *sourceStart = (UTF8 const *)Fragment.data();
>> +    // Is ResultPtr properly aligned? are we violating aliasing rules?
>> +    UTF16 *targetStart = reinterpret_cast<UTF16*>(ResultPtr);
>>
>> Err, oww, sorry about not noticing this earlier...this breaks
>> big-endian hosts.  At least for now, please use a local buffer and do
>> a copy which preserves the current representation.  (It would be nice
>> to fix strings to use native-endian arrays in a subsequent patch, but
>> that's a bit more involved.)
>
> It appears to me that it's already messed up on big endian platforms. With git revision f84139a1331c6:
>
> Source tmp.cpp:
> char32_t c[] = U"abc";
>
> Command:
> Release+Asserts/bin/clang++ -std=c++0x -stdlib=libc++ -c tmp.cpp -arch ppc -emit-llvm -O0 -o - | llvm-dis
>
> Output:
> ; ModuleID = '<stdin>'
> target datalayout = "E-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v128:128:128-n32"
> target triple = "powerpc-apple-darwin11.1.0"
>
> @c = global [16 x i8] c"a\00\00\00b\00\00\00c\00\00\00\00\00\00\00", align 4
>
> As you note the current code always does little endian for wide characters, but it appears that this isn't taken into account during codegen in ActOnStringLiteral. I think my new code would actually work around the problem when building for a big-endian target if the host is big-endian (though I haven't actually tested it).
>
> I'll go ahead and change my code to maintain the current little-endian internal representation for now; I may be missing something and its probably better not to disturb the behavior until the other issues are also fixed anyway.

Okay.

> I would like to have the internal representation be native-endian arrays as well. Can you point out to me what you think the issues will be?

I'm not really expecting difficulties, just making sure everything
uses the correct types to access the string where appropriate.
Actually, there isn't much that actually examines the contents of the
string between the string literal parser and IRGen; I guess there's
the printf checker and a couple of other things...

> Also I notice that the constant generated for:
>
> char32_t c[] = {U'a',U'b',U'c',U'\0'};
>
> is
>
> @c = global [4 x i32] [i32 97, i32 98, i32 99, i32 0], align 4
>
> It seems to me that this is preferable to the [16 x i8] definition even for string literals. Would this be acceptable? Would it be a large change or one I can make just in ActOnStringLiteral?

Yes; that's part of the change to use native-endian arrays, I think...
do everything consistently, and nothing has to worry about endianness
on the strings until the object file is written.

>>
>> +// RUN: %clang_cc1 -std=c++0x -fsyntax-only -verify %s
>> +
>> +void f() {
>> +    wchar_t const *a = L"�����"; // expected-error {{ illegal
>> sequence in string literal }}
>> +
>> +    char16_t const *b = u"�����"; // expected-error {{ illegal
>> sequence in string literal }}
>> +    char32_t const *c = U"�����"; // expected-error {{ illegal
>> sequence in string literal }}
>> +    wchar_t const *d = LR"(�����)"; // expected-error {{ illegal
>> sequence in string literal }}
>> +    char16_t const *e = uR"(�����)"; // expected-error {{ illegal
>> sequence in string literal }}
>> +    char32_t const *f = UR"(�����)"; // expected-error {{ illegal
>> sequence in string literal }}
>> +}
>> \ No newline at end of file
>> --
>>
>> Please put a newline at the end of the file.  Please, please put in a
>> comment that people should be very careful when they edit this file,
>> in case their text editor "fixes" the bad encoding.
>
> Thanks, I'll do that. Although this file is consistently ISO-8859-1 and shouldn't cause much of a problem. The other test file, however, is mixed ISO-8859-1 and UTF-8, so it could be confusing. Xcode guesses at its encoding and comes up with Mac Roman. (although it's interesting that gitk shows all the string literals in their correct encoding. Apparently it guesses encodings on a line by line basis.)

That's... interesting.

-Eli