[cfe-commits] [cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Sun Aug 28 18:26:51 PDT 2011

Here's a set of patches updated as per your comments.

On Aug 24, 2011, at 3:08 PM, Eli Friedman wrote:

> On Tue, Aug 23, 2011 at 9:14 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>> 
>> Attached is a patch which allows UTF-16 and UTF-32 string literals to work as expected (i.e., the source string literal data is converted from the input encoding (currently always UTF-8) to UTF-16 and UTF-32, and can be accessed as such at runtime). The patch only changes how non-escaped data in string literals is handled, hex escape sequence and universal character name handling isn't changed.
>> 
>> Can someone take a look and let me know what needs changing to get this accepted? So far I expect I'll need to add tests:
>> 
>> - test all string types (u8, u8R, u, uR, U, UR, L, LR) with valid UTF-8 data, verify that the output object file contains the expected data
>> - test the new error using an ISO-8859-1 encoded file containing accented characters in string literal
>> 
>> Can anyone recommend existing tests I can look to for examples for implementing these tests? What other tests I should have? What other changes to the code are needed?
> 
> @@ -1036,7 +1037,13 @@ void StringLiteralParser::init(const Token
> *StringToks, unsigned NumStringToks){
>         ThisTokEnd -= (ThisTokBuf - Prefix);
> 
>       // Copy the string over
> -      CopyStringFragment(StringRef(ThisTokBuf, ThisTokEnd - ThisTokBuf));
> +      if (CopyStringFragment(StringRef(ThisTokBuf,ThisTokEnd-ThisTokBuf))
> +          && Diags)
> +      {
> +        Diags->Report(FullSourceLoc(StringToks[i].getLocation(), SM),
> +                      diag::err_bad_string_encoding);
> +      }
> 
> You should set hadError = true here.
> 
> Besides that, I don't see any issues with the code.  Please split the
> ConvertUTF changes into a separate patch, though.
> 
> In terms of tests, test/Lexer/string_concat.cpp might be a good
> example to look at for the invalid cases.  For the valid cases, take a
> look at test/CodeGen/string-literal.c .
> 
>> Also while working on my patch I noticed the following warning:
>> 
>>> test.cpp:33:20: warning: character unicode escape sequence too long for its type
>>>     char16_t c[] = u"\U0001F47F";
>>>                    ^
>> 
>> I found that the resulting code behaves as expected (producing the appropriate UTF-16 surrogate pair in the array). Should there really be a warning here?
> 
> No; I'm pretty sure the warning is meant to catch L"\U0001F47F" in
> -fshort-wchar mode.
> 
> -Eli
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-enable-use-of-ConvertUTF8toUTF32.patch
Type: application/octet-stream
Size: 7147 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20110828/f9d98baf/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-Convert-string-literal-data-from-source-to-target-en.patch
Type: application/octet-stream
Size: 7377 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20110828/f9d98baf/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-create-tests-for-unicode-string-literal-conversions.patch
Type: application/octet-stream
Size: 4257 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20110828/f9d98baf/attachment-0002.obj>