[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Wed Aug 24 12:08:29 PDT 2011

On Tue, Aug 23, 2011 at 9:14 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>
> Attached is a patch which allows UTF-16 and UTF-32 string literals to work as expected (i.e., the source string literal data is converted from the input encoding (currently always UTF-8) to UTF-16 and UTF-32, and can be accessed as such at runtime). The patch only changes how non-escaped data in string literals is handled, hex escape sequence and universal character name handling isn't changed.
>
> Can someone take a look and let me know what needs changing to get this accepted? So far I expect I'll need to add tests:
>
> - test all string types (u8, u8R, u, uR, U, UR, L, LR) with valid UTF-8 data, verify that the output object file contains the expected data
> - test the new error using an ISO-8859-1 encoded file containing accented characters in string literal
>
> Can anyone recommend existing tests I can look to for examples for implementing these tests? What other tests I should have? What other changes to the code are needed?

@@ -1036,7 +1037,13 @@ void StringLiteralParser::init(const Token
*StringToks, unsigned NumStringToks){
         ThisTokEnd -= (ThisTokBuf - Prefix);

       // Copy the string over
-      CopyStringFragment(StringRef(ThisTokBuf, ThisTokEnd - ThisTokBuf));
+      if (CopyStringFragment(StringRef(ThisTokBuf,ThisTokEnd-ThisTokBuf))
+          && Diags)
+      {
+        Diags->Report(FullSourceLoc(StringToks[i].getLocation(), SM),
+                      diag::err_bad_string_encoding);
+      }

You should set hadError = true here.

Besides that, I don't see any issues with the code.  Please split the
ConvertUTF changes into a separate patch, though.

In terms of tests, test/Lexer/string_concat.cpp might be a good
example to look at for the invalid cases.  For the valid cases, take a
look at test/CodeGen/string-literal.c .

> Also while working on my patch I noticed the following warning:
>
>> test.cpp:33:20: warning: character unicode escape sequence too long for its type
>>     char16_t c[] = u"\U0001F47F";
>>                    ^
>
> I found that the resulting code behaves as expected (producing the appropriate UTF-16 surrogate pair in the array). Should there really be a warning here?

No; I'm pretty sure the warning is meant to catch L"\U0001F47F" in
-fshort-wchar mode.

-Eli