[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Wed Aug 31 21:14:21 PDT 2011

On Aug 29, 2011, at 2:50 PM, Eli Friedman wrote:

> On Sun, Aug 28, 2011 at 6:26 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>> Here's a set of patches updated as per your comments.
> 
> @@ -1119,20 +1133,37 @@ void StringLiteralParser::init(const Token
> *StringToks, unsigned NumStringToks){
> 
> /// copyStringFragment - This function copies from Start to End into ResultPtr.
> /// Performs widening for multi-byte characters.
> -void StringLiteralParser::CopyStringFragment(StringRef Fragment) {
> +bool StringLiteralParser::CopyStringFragment(StringRef Fragment) {
> +  assert(CharByteWidth==1 || CharByteWidth==2 || CharByteWidth==4);
> +  ConversionResult result = conversionOK;
>   // Copy the character span over.
>   if (CharByteWidth == 1) {
>     memcpy(ResultPtr, Fragment.data(), Fragment.size());
>     ResultPtr += Fragment.size();
> -  } else {
> -    // Note: our internal rep of wide char tokens is always little-endian.
> -    for (StringRef::iterator I=Fragment.begin(), E=Fragment.end(); I!=E; ++I) {
> -      *ResultPtr++ = *I;
> -      // Add zeros at the end.
> -      for (unsigned i = 1, e = CharByteWidth; i != e; ++i)
> -        *ResultPtr++ = 0;
> -    }
> +  } else if (CharByteWidth == 2) {
> +    UTF8 const *sourceStart = (UTF8 const *)Fragment.data();
> +    // Is ResultPtr properly aligned? are we violating aliasing rules?
> +    UTF16 *targetStart = reinterpret_cast<UTF16*>(ResultPtr);
> 
> Err, oww, sorry about not noticing this earlier...this breaks
> big-endian hosts.  At least for now, please use a local buffer and do
> a copy which preserves the current representation.  (It would be nice
> to fix strings to use native-endian arrays in a subsequent patch, but
> that's a bit more involved.)

It appears to me that it's already messed up on big endian platforms. With git revision f84139a1331c6:

Source tmp.cpp:
char32_t c[] = U"abc";

Command:
Release+Asserts/bin/clang++ -std=c++0x -stdlib=libc++ -c tmp.cpp -arch ppc -emit-llvm -O0 -o - | llvm-dis

Output:
; ModuleID = '<stdin>'
target datalayout = "E-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v128:128:128-n32"
target triple = "powerpc-apple-darwin11.1.0"

@c = global [16 x i8] c"a\00\00\00b\00\00\00c\00\00\00\00\00\00\00", align 4

As you note the current code always does little endian for wide characters, but it appears that this isn't taken into account during codegen in ActOnStringLiteral. I think my new code would actually work around the problem when building for a big-endian target if the host is big-endian (though I haven't actually tested it).

I'll go ahead and change my code to maintain the current little-endian internal representation for now; I may be missing something and its probably better not to disturb the behavior until the other issues are also fixed anyway.

I would like to have the internal representation be native-endian arrays as well. Can you point out to me what you think the issues will be?

Also I notice that the constant generated for:

char32_t c[] = {U'a',U'b',U'c',U'\0'};

is

@c = global [4 x i32] [i32 97, i32 98, i32 99, i32 0], align 4

It seems to me that this is preferable to the [16 x i8] definition even for string literals. Would this be acceptable? Would it be a large change or one I can make just in ActOnStringLiteral?

> 
> +// RUN: %clang_cc1 -std=c++0x -fsyntax-only -verify %s
> +
> +void f() {
> +    wchar_t const *a = L"�����"; // expected-error {{ illegal
> sequence in string literal }}
> +
> +    char16_t const *b = u"�����"; // expected-error {{ illegal
> sequence in string literal }}
> +    char32_t const *c = U"�����"; // expected-error {{ illegal
> sequence in string literal }}
> +    wchar_t const *d = LR"(�����)"; // expected-error {{ illegal
> sequence in string literal }}
> +    char16_t const *e = uR"(�����)"; // expected-error {{ illegal
> sequence in string literal }}
> +    char32_t const *f = UR"(�����)"; // expected-error {{ illegal
> sequence in string literal }}
> +}
> \ No newline at end of file
> -- 
> 
> Please put a newline at the end of the file.  Please, please put in a
> comment that people should be very careful when they edit this file,
> in case their text editor "fixes" the bad encoding.

Thanks, I'll do that. Although this file is consistently ISO-8859-1 and shouldn't cause much of a problem. The other test file, however, is mixed ISO-8859-1 and UTF-8, so it could be confusing. Xcode guesses at its encoding and comes up with Mac Roman. (although it's interesting that gitk shows all the string literals in their correct encoding. Apparently it guesses encodings on a line by line basis.)

> 
> -Eli
> 
>> On Aug 24, 2011, at 3:08 PM, Eli Friedman wrote:
>> 
>>> On Tue, Aug 23, 2011 at 9:14 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>>>> 
>>>> Attached is a patch which allows UTF-16 and UTF-32 string literals to work as expected (i.e., the source string literal data is converted from the input encoding (currently always UTF-8) to UTF-16 and UTF-32, and can be accessed as such at runtime). The patch only changes how non-escaped data in string literals is handled, hex escape sequence and universal character name handling isn't changed.
>>>> 
>>>> Can someone take a look and let me know what needs changing to get this accepted? So far I expect I'll need to add tests:
>>>> 
>>>> - test all string types (u8, u8R, u, uR, U, UR, L, LR) with valid UTF-8 data, verify that the output object file contains the expected data
>>>> - test the new error using an ISO-8859-1 encoded file containing accented characters in string literal
>>>> 
>>>> Can anyone recommend existing tests I can look to for examples for implementing these tests? What other tests I should have? What other changes to the code are needed?
>>> 
>>> @@ -1036,7 +1037,13 @@ void StringLiteralParser::init(const Token
>>> *StringToks, unsigned NumStringToks){
>>>         ThisTokEnd -= (ThisTokBuf - Prefix);
>>> 
>>>       // Copy the string over
>>> -      CopyStringFragment(StringRef(ThisTokBuf, ThisTokEnd - ThisTokBuf));
>>> +      if (CopyStringFragment(StringRef(ThisTokBuf,ThisTokEnd-ThisTokBuf))
>>> +          && Diags)
>>> +      {
>>> +        Diags->Report(FullSourceLoc(StringToks[i].getLocation(), SM),
>>> +                      diag::err_bad_string_encoding);
>>> +      }
>>> 
>>> You should set hadError = true here.
>>> 
>>> Besides that, I don't see any issues with the code.  Please split the
>>> ConvertUTF changes into a separate patch, though.
>>> 
>>> In terms of tests, test/Lexer/string_concat.cpp might be a good
>>> example to look at for the invalid cases.  For the valid cases, take a
>>> look at test/CodeGen/string-literal.c .
>>> 
>>>> Also while working on my patch I noticed the following warning:
>>>> 
>>>>> test.cpp:33:20: warning: character unicode escape sequence too long for its type
>>>>>     char16_t c[] = u"\U0001F47F";
>>>>>                    ^
>>>> 
>>>> I found that the resulting code behaves as expected (producing the appropriate UTF-16 surrogate pair in the array). Should there really be a warning here?
>>> 
>>> No; I'm pretty sure the warning is meant to catch L"\U0001F47F" in
>>> -fshort-wchar mode.
>>> 
>>> -Eli
>>