[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Wed Aug 24 17:45:47 PDT 2011

On Aug 24, 2011, at 3:08 PM, Eli Friedman wrote:

> On Tue, Aug 23, 2011 at 9:14 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>> 
>> Attached is a patch which allows UTF-16 and UTF-32 string literals to work as expected (i.e., the source string literal data is converted from the input encoding (currently always UTF-8) to UTF-16 and UTF-32, and can be accessed as such at runtime). The patch only changes how non-escaped data in string literals is handled, hex escape sequence and universal character name handling isn't changed.
>> 
>> Can someone take a look and let me know what needs changing to get this accepted? So far I expect I'll need to add tests:
>> 
>> - test all string types (u8, u8R, u, uR, U, UR, L, LR) with valid UTF-8 data, verify that the output object file contains the expected data
>> - test the new error using an ISO-8859-1 encoded file containing accented characters in string literal
>> 
>> Can anyone recommend existing tests I can look to for examples for implementing these tests? What other tests I should have? What other changes to the code are needed?
> 
> @@ -1036,7 +1037,13 @@ void StringLiteralParser::init(const Token
> *StringToks, unsigned NumStringToks){
>         ThisTokEnd -= (ThisTokBuf - Prefix);
> 
>       // Copy the string over
> -      CopyStringFragment(StringRef(ThisTokBuf, ThisTokEnd - ThisTokBuf));
> +      if (CopyStringFragment(StringRef(ThisTokBuf,ThisTokEnd-ThisTokBuf))
> +          && Diags)
> +      {
> +        Diags->Report(FullSourceLoc(StringToks[i].getLocation(), SM),
> +                      diag::err_bad_string_encoding);
> +      }
> 
> You should set hadError = true here.
> 
> Besides that, I don't see any issues with the code.  Please split the
> ConvertUTF changes into a separate patch, though.

Okay, thanks. These changes'll be in the next version of the patch.

> 
> In terms of tests, test/Lexer/string_concat.cpp might be a good
> example to look at for the invalid cases.  For the valid cases, take a
> look at test/CodeGen/string-literal.c .
> 
>> Also while working on my patch I noticed the following warning:
>> 
>>> test.cpp:33:20: warning: character unicode escape sequence too long for its type
>>>     char16_t c[] = u"\U0001F47F";
>>>                    ^
>> 
>> I found that the resulting code behaves as expected (producing the appropriate UTF-16 surrogate pair in the array). Should there really be a warning here?
> 
> No; I'm pretty sure the warning is meant to catch L"\U0001F47F" in
> -fshort-wchar mode.

short-wchar will make wchar two bytes, right? In that case I think it should be handled the same as u"\U0001F47F". Now, the warning (or maybe even an error) makes sense to me in the case of u'\U0001F47F' and L'\U0001F47F' with short-wchar (perhaps that's even what you meant).

Which reminds me that I should probably make sure that the unicode character literals work as well as the string literals.

Thanks,
Seth

> 
> -Eli