[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Mon Oct 24 19:13:40 PDT 2011

On Mon, Oct 24, 2011 at 6:43 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>
> On Oct 24, 2011, at 1:14 AM, Eli Friedman wrote:
>
>> On Sun, Oct 23, 2011 at 8:51 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>>> Here's a new set of patches for review. In addition to the previous changes in string literal encoding, this set changes the way non-char based strings are represented in LLVM IR. Now non-char based strings are natively arrays of i32 and i16 based on the size of the character element used in the string constant.
>>>
>>>
>>>
>>>
>>> As per the previous comments, I fixed the newline at the end of the new test files and added comments on the test files' encodings. Patch 3 is the one dealing with changing the representation of string constants to use arrays with elements of the right size for wchar_t, char16_t, and char32_t strings. Patch 5 modifies preexisting tests that expect the old representation.
>>>
>>> There's still one test failure that I'm not sure about. CodeGen/global-init.c expects a line like:
>>>
>>> @l = global { [24 x i8], i32 } { [24 x i8] c"f\00\00\00o\00\00\00o\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00", i32 1 }
>>>
>>> and building with these patches produce a different result:
>>>
>>> %struct.K = type { [6 x i32], i32 }
>>> ...
>>> @l = global %struct.K { [6 x i32] [i32 102, i32 111, i32 111, i32 0, i32 65, i32 0], i32 1 }, align 4
>>>
>>> This includes a change other than just the string representation. Now the struct type K is created and referenced instead of being defined inline. I don't think this is related to any changes I made, but I don't what to just update the test with the output I get without being certain that it's correct.
>>
>> That's fine: the reason you're seeing the change is that where
>> possible, we try to generate globals with an appropriate type.  Given
>> the old representation, that was not possible for this case; with the
>> new representation, it is.
>>
>>> Let me know what you think.
>>
>> diff --git a/lib/Serialization/ASTReaderStmt.cpp
>> b/lib/Serialization/ASTReaderStmt.cpp
>> index 85d0f92..aeb40fb 100644
>> --- a/lib/Serialization/ASTReaderStmt.cpp
>> +++ b/lib/Serialization/ASTReaderStmt.cpp
>> @@ -372,12 +372,13 @@ void ASTStmtReader::VisitStringLiteral(StringLiteral *E) {
>>   assert(Record[Idx] == E->getNumConcatenated() &&
>>          "Wrong number of concatenated tokens!");
>>   ++Idx;
>> -  E->Kind = static_cast<StringLiteral::StringKind>(Record[Idx++]);
>> -  E->IsPascal = Record[Idx++];
>> +  StringLiteral::StringKind kind =
>> +        static_cast<StringLiteral::StringKind>(Record[Idx++]);
>> +  bool isPascal = Record[Idx++];
>>
>>   // Read string data
>>   llvm::SmallString<16> Str(&Record[Idx], &Record[Idx] + Len);
>> -  E->setString(Reader.getContext(), Str.str());
>> +  E->setString(Reader.getContext(), Str.str(),kind, isPascal);
>>   Idx += Len;
>>
>> Whitespace; also, you might want to add a FIXME about how we read in
>> the string.  (Ideally, the format of serialized AST files should not
>> vary across platforms.)
>
> I'm not sure that's possible in this case.
>
> Consider the literals L"\xD83D\xDC7F" and L"\U0001F47F" (The former being the UTF-16 surrogate pair corresponding to the latter). For a Windows target these two literals are indistinguishable after we do the initial translation in Sema::ActOnStringLiteral. The resulting StringLiteral will store the same data, and have the same Kind and CharByteWidth. But on a platform with sizeof(wchar_t)==4 the two will not be the same.

That isn't what I meant; it's okay for the serialized AST to vary
across targets.  It just shouldn't vary across hosts.

-Eli