[cfe-commits] [cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Mon Aug 29 11:50:24 PDT 2011

On Sun, Aug 28, 2011 at 6:26 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
> Here's a set of patches updated as per your comments.

@@ -1119,20 +1133,37 @@ void StringLiteralParser::init(const Token
*StringToks, unsigned NumStringToks){

 /// copyStringFragment - This function copies from Start to End into ResultPtr.
 /// Performs widening for multi-byte characters.
-void StringLiteralParser::CopyStringFragment(StringRef Fragment) {
+bool StringLiteralParser::CopyStringFragment(StringRef Fragment) {
+  assert(CharByteWidth==1 || CharByteWidth==2 || CharByteWidth==4);
+  ConversionResult result = conversionOK;
   // Copy the character span over.
   if (CharByteWidth == 1) {
     memcpy(ResultPtr, Fragment.data(), Fragment.size());
     ResultPtr += Fragment.size();
-  } else {
-    // Note: our internal rep of wide char tokens is always little-endian.
-    for (StringRef::iterator I=Fragment.begin(), E=Fragment.end(); I!=E; ++I) {
-      *ResultPtr++ = *I;
-      // Add zeros at the end.
-      for (unsigned i = 1, e = CharByteWidth; i != e; ++i)
-        *ResultPtr++ = 0;
-    }
+  } else if (CharByteWidth == 2) {
+    UTF8 const *sourceStart = (UTF8 const *)Fragment.data();
+    // Is ResultPtr properly aligned? are we violating aliasing rules?
+    UTF16 *targetStart = reinterpret_cast<UTF16*>(ResultPtr);

Err, oww, sorry about not noticing this earlier...this breaks
big-endian hosts.  At least for now, please use a local buffer and do
a copy which preserves the current representation.  (It would be nice
to fix strings to use native-endian arrays in a subsequent patch, but
that's a bit more involved.)

+// RUN: %clang_cc1 -std=c++0x -fsyntax-only -verify %s
+
+void f() {
+    wchar_t const *a = L"�����"; // expected-error {{ illegal
sequence in string literal }}
+
+    char16_t const *b = u"�����"; // expected-error {{ illegal
sequence in string literal }}
+    char32_t const *c = U"�����"; // expected-error {{ illegal
sequence in string literal }}
+    wchar_t const *d = LR"(�����)"; // expected-error {{ illegal
sequence in string literal }}
+    char16_t const *e = uR"(�����)"; // expected-error {{ illegal
sequence in string literal }}
+    char32_t const *f = UR"(�����)"; // expected-error {{ illegal
sequence in string literal }}
+}
\ No newline at end of file
-- 

Please put a newline at the end of the file.  Please, please put in a
comment that people should be very careful when they edit this file,
in case their text editor "fixes" the bad encoding.

-Eli

> On Aug 24, 2011, at 3:08 PM, Eli Friedman wrote:
>
>> On Tue, Aug 23, 2011 at 9:14 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>>>
>>> Attached is a patch which allows UTF-16 and UTF-32 string literals to work as expected (i.e., the source string literal data is converted from the input encoding (currently always UTF-8) to UTF-16 and UTF-32, and can be accessed as such at runtime). The patch only changes how non-escaped data in string literals is handled, hex escape sequence and universal character name handling isn't changed.
>>>
>>> Can someone take a look and let me know what needs changing to get this accepted? So far I expect I'll need to add tests:
>>>
>>> - test all string types (u8, u8R, u, uR, U, UR, L, LR) with valid UTF-8 data, verify that the output object file contains the expected data
>>> - test the new error using an ISO-8859-1 encoded file containing accented characters in string literal
>>>
>>> Can anyone recommend existing tests I can look to for examples for implementing these tests? What other tests I should have? What other changes to the code are needed?
>>
>> @@ -1036,7 +1037,13 @@ void StringLiteralParser::init(const Token
>> *StringToks, unsigned NumStringToks){
>>         ThisTokEnd -= (ThisTokBuf - Prefix);
>>
>>       // Copy the string over
>> -      CopyStringFragment(StringRef(ThisTokBuf, ThisTokEnd - ThisTokBuf));
>> +      if (CopyStringFragment(StringRef(ThisTokBuf,ThisTokEnd-ThisTokBuf))
>> +          && Diags)
>> +      {
>> +        Diags->Report(FullSourceLoc(StringToks[i].getLocation(), SM),
>> +                      diag::err_bad_string_encoding);
>> +      }
>>
>> You should set hadError = true here.
>>
>> Besides that, I don't see any issues with the code.  Please split the
>> ConvertUTF changes into a separate patch, though.
>>
>> In terms of tests, test/Lexer/string_concat.cpp might be a good
>> example to look at for the invalid cases.  For the valid cases, take a
>> look at test/CodeGen/string-literal.c .
>>
>>> Also while working on my patch I noticed the following warning:
>>>
>>>> test.cpp:33:20: warning: character unicode escape sequence too long for its type
>>>>     char16_t c[] = u"\U0001F47F";
>>>>                    ^
>>>
>>> I found that the resulting code behaves as expected (producing the appropriate UTF-16 surrogate pair in the array). Should there really be a warning here?
>>
>> No; I'm pretty sure the warning is meant to catch L"\U0001F47F" in
>> -fshort-wchar mode.
>>
>> -Eli
>