[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Sun Oct 23 20:51:20 PDT 2011

Here's a new set of patches for review. In addition to the previous changes in string literal encoding, this set changes the way non-char based strings are represented in LLVM IR. Now non-char based strings are natively arrays of i32 and i16 based on the size of the character element used in the string constant.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-enable-use-of-ConvertUTF8toUTF32.patch
Type: application/octet-stream
Size: 7147 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20111023/de596dac/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-Convert-string-literal-data-from-source-to-target-en.patch
Type: application/octet-stream
Size: 7403 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20111023/de596dac/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-Represent-string-literals-as-arrays-of-elements-of-t.patch
Type: application/octet-stream
Size: 16461 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20111023/de596dac/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0004-create-tests-for-unicode-string-literal-conversions.patch
Type: application/octet-stream
Size: 4853 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20111023/de596dac/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0005-update-tests-to-account-for-new-string-constant-repr.patch
Type: application/octet-stream
Size: 7155 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20111023/de596dac/attachment-0004.obj>
-------------- next part --------------

As per the previous comments, I fixed the newline at the end of the new test files and added comments on the test files' encodings. Patch 3 is the one dealing with changing the representation of string constants to use arrays with elements of the right size for wchar_t, char16_t, and char32_t strings. Patch 5 modifies preexisting tests that expect the old representation.

There's still one test failure that I'm not sure about. CodeGen/global-init.c expects a line like:

@l = global { [24 x i8], i32 } { [24 x i8] c"f\00\00\00o\00\00\00o\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00", i32 1 }

and building with these patches produce a different result: 

%struct.K = type { [6 x i32], i32 }
...
@l = global %struct.K { [6 x i32] [i32 102, i32 111, i32 111, i32 0, i32 65, i32 0], i32 1 }, align 4

This includes a change other than just the string representation. Now the struct type K is created and referenced instead of being defined inline. I don't think this is related to any changes I made, but I don't what to just update the test with the output I get without being certain that it's correct.

Let me know what you think.

Thanks,
- Seth

On Sep 1, 2011, at 12:53 AM, Eli Friedman wrote:

> On Wed, Aug 31, 2011 at 9:14 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>> 
>> On Aug 29, 2011, at 2:50 PM, Eli Friedman wrote:
>> 
>>> On Sun, Aug 28, 2011 at 6:26 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>>>> Here's a set of patches updated as per your comments.
>>> 
>>> @@ -1119,20 +1133,37 @@ void StringLiteralParser::init(const Token
>>> *StringToks, unsigned NumStringToks){
>>> 
>>> /// copyStringFragment - This function copies from Start to End into ResultPtr.
>>> /// Performs widening for multi-byte characters.
>>> -void StringLiteralParser::CopyStringFragment(StringRef Fragment) {
>>> +bool StringLiteralParser::CopyStringFragment(StringRef Fragment) {
>>> +  assert(CharByteWidth==1 || CharByteWidth==2 || CharByteWidth==4);
>>> +  ConversionResult result = conversionOK;
>>>   // Copy the character span over.
>>>   if (CharByteWidth == 1) {
>>>     memcpy(ResultPtr, Fragment.data(), Fragment.size());
>>>     ResultPtr += Fragment.size();
>>> -  } else {
>>> -    // Note: our internal rep of wide char tokens is always little-endian.
>>> -    for (StringRef::iterator I=Fragment.begin(), E=Fragment.end(); I!=E; ++I) {
>>> -      *ResultPtr++ = *I;
>>> -      // Add zeros at the end.
>>> -      for (unsigned i = 1, e = CharByteWidth; i != e; ++i)
>>> -        *ResultPtr++ = 0;
>>> -    }
>>> +  } else if (CharByteWidth == 2) {
>>> +    UTF8 const *sourceStart = (UTF8 const *)Fragment.data();
>>> +    // Is ResultPtr properly aligned? are we violating aliasing rules?
>>> +    UTF16 *targetStart = reinterpret_cast<UTF16*>(ResultPtr);
>>> 
>>> Err, oww, sorry about not noticing this earlier...this breaks
>>> big-endian hosts.  At least for now, please use a local buffer and do
>>> a copy which preserves the current representation.  (It would be nice
>>> to fix strings to use native-endian arrays in a subsequent patch, but
>>> that's a bit more involved.)
>> 
>> It appears to me that it's already messed up on big endian platforms. With git revision f84139a1331c6:
>> 
>> Source tmp.cpp:
>> char32_t c[] = U"abc";
>> 
>> Command:
>> Release+Asserts/bin/clang++ -std=c++0x -stdlib=libc++ -c tmp.cpp -arch ppc -emit-llvm -O0 -o - | llvm-dis
>> 
>> Output:
>> ; ModuleID = '<stdin>'
>> target datalayout = "E-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v128:128:128-n32"
>> target triple = "powerpc-apple-darwin11.1.0"
>> 
>> @c = global [16 x i8] c"a\00\00\00b\00\00\00c\00\00\00\00\00\00\00", align 4
>> 
>> As you note the current code always does little endian for wide characters, but it appears that this isn't taken into account during codegen in ActOnStringLiteral. I think my new code would actually work around the problem when building for a big-endian target if the host is big-endian (though I haven't actually tested it).
>> 
>> I'll go ahead and change my code to maintain the current little-endian internal representation for now; I may be missing something and its probably better not to disturb the behavior until the other issues are also fixed anyway.
> 
> Okay.
> 
>> I would like to have the internal representation be native-endian arrays as well. Can you point out to me what you think the issues will be?
> 
> I'm not really expecting difficulties, just making sure everything
> uses the correct types to access the string where appropriate.
> Actually, there isn't much that actually examines the contents of the
> string between the string literal parser and IRGen; I guess there's
> the printf checker and a couple of other things...
> 
>> Also I notice that the constant generated for:
>> 
>> char32_t c[] = {U'a',U'b',U'c',U'\0'};
>> 
>> is
>> 
>> @c = global [4 x i32] [i32 97, i32 98, i32 99, i32 0], align 4
>> 
>> It seems to me that this is preferable to the [16 x i8] definition even for string literals. Would this be acceptable? Would it be a large change or one I can make just in ActOnStringLiteral?
> 
> Yes; that's part of the change to use native-endian arrays, I think...
> do everything consistently, and nothing has to worry about endianness
> on the strings until the object file is written.
> 
>>> 
>>> +// RUN: %clang_cc1 -std=c++0x -fsyntax-only -verify %s
>>> +
>>> +void f() {
>>> +    wchar_t const *a = L"�����"; // expected-error {{ illegal
>>> sequence in string literal }}
>>> +
>>> +    char16_t const *b = u"�����"; // expected-error {{ illegal
>>> sequence in string literal }}
>>> +    char32_t const *c = U"�����"; // expected-error {{ illegal
>>> sequence in string literal }}
>>> +    wchar_t const *d = LR"(�����)"; // expected-error {{ illegal
>>> sequence in string literal }}
>>> +    char16_t const *e = uR"(�����)"; // expected-error {{ illegal
>>> sequence in string literal }}
>>> +    char32_t const *f = UR"(�����)"; // expected-error {{ illegal
>>> sequence in string literal }}
>>> +}
>>> \ No newline at end of file
>>> --
>>> 
>>> Please put a newline at the end of the file.  Please, please put in a
>>> comment that people should be very careful when they edit this file,
>>> in case their text editor "fixes" the bad encoding.
>> 
>> Thanks, I'll do that. Although this file is consistently ISO-8859-1 and shouldn't cause much of a problem. The other test file, however, is mixed ISO-8859-1 and UTF-8, so it could be confusing. Xcode guesses at its encoding and comes up with Mac Roman. (although it's interesting that gitk shows all the string literals in their correct encoding. Apparently it guesses encodings on a line by line basis.)
> 
> That's... interesting.
> 
> -Eli