[cfe-commits] [cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases
Eli Friedman
eli.friedman at gmail.com
Mon Aug 29 11:50:24 PDT 2011
On Sun, Aug 28, 2011 at 6:26 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
> Here's a set of patches updated as per your comments.
@@ -1119,20 +1133,37 @@ void StringLiteralParser::init(const Token
*StringToks, unsigned NumStringToks){
/// copyStringFragment - This function copies from Start to End into ResultPtr.
/// Performs widening for multi-byte characters.
-void StringLiteralParser::CopyStringFragment(StringRef Fragment) {
+bool StringLiteralParser::CopyStringFragment(StringRef Fragment) {
+ assert(CharByteWidth==1 || CharByteWidth==2 || CharByteWidth==4);
+ ConversionResult result = conversionOK;
// Copy the character span over.
if (CharByteWidth == 1) {
memcpy(ResultPtr, Fragment.data(), Fragment.size());
ResultPtr += Fragment.size();
- } else {
- // Note: our internal rep of wide char tokens is always little-endian.
- for (StringRef::iterator I=Fragment.begin(), E=Fragment.end(); I!=E; ++I) {
- *ResultPtr++ = *I;
- // Add zeros at the end.
- for (unsigned i = 1, e = CharByteWidth; i != e; ++i)
- *ResultPtr++ = 0;
- }
+ } else if (CharByteWidth == 2) {
+ UTF8 const *sourceStart = (UTF8 const *)Fragment.data();
+ // Is ResultPtr properly aligned? are we violating aliasing rules?
+ UTF16 *targetStart = reinterpret_cast<UTF16*>(ResultPtr);
Err, oww, sorry about not noticing this earlier...this breaks
big-endian hosts. At least for now, please use a local buffer and do
a copy which preserves the current representation. (It would be nice
to fix strings to use native-endian arrays in a subsequent patch, but
that's a bit more involved.)
+// RUN: %clang_cc1 -std=c++0x -fsyntax-only -verify %s
+
+void f() {
+ wchar_t const *a = L"�����"; // expected-error {{ illegal
sequence in string literal }}
+
+ char16_t const *b = u"�����"; // expected-error {{ illegal
sequence in string literal }}
+ char32_t const *c = U"�����"; // expected-error {{ illegal
sequence in string literal }}
+ wchar_t const *d = LR"(�����)"; // expected-error {{ illegal
sequence in string literal }}
+ char16_t const *e = uR"(�����)"; // expected-error {{ illegal
sequence in string literal }}
+ char32_t const *f = UR"(�����)"; // expected-error {{ illegal
sequence in string literal }}
+}
\ No newline at end of file
--
Please put a newline at the end of the file. Please, please put in a
comment that people should be very careful when they edit this file,
in case their text editor "fixes" the bad encoding.
-Eli
> On Aug 24, 2011, at 3:08 PM, Eli Friedman wrote:
>
>> On Tue, Aug 23, 2011 at 9:14 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>>>
>>> Attached is a patch which allows UTF-16 and UTF-32 string literals to work as expected (i.e., the source string literal data is converted from the input encoding (currently always UTF-8) to UTF-16 and UTF-32, and can be accessed as such at runtime). The patch only changes how non-escaped data in string literals is handled, hex escape sequence and universal character name handling isn't changed.
>>>
>>> Can someone take a look and let me know what needs changing to get this accepted? So far I expect I'll need to add tests:
>>>
>>> - test all string types (u8, u8R, u, uR, U, UR, L, LR) with valid UTF-8 data, verify that the output object file contains the expected data
>>> - test the new error using an ISO-8859-1 encoded file containing accented characters in string literal
>>>
>>> Can anyone recommend existing tests I can look to for examples for implementing these tests? What other tests I should have? What other changes to the code are needed?
>>
>> @@ -1036,7 +1037,13 @@ void StringLiteralParser::init(const Token
>> *StringToks, unsigned NumStringToks){
>> ThisTokEnd -= (ThisTokBuf - Prefix);
>>
>> // Copy the string over
>> - CopyStringFragment(StringRef(ThisTokBuf, ThisTokEnd - ThisTokBuf));
>> + if (CopyStringFragment(StringRef(ThisTokBuf,ThisTokEnd-ThisTokBuf))
>> + && Diags)
>> + {
>> + Diags->Report(FullSourceLoc(StringToks[i].getLocation(), SM),
>> + diag::err_bad_string_encoding);
>> + }
>>
>> You should set hadError = true here.
>>
>> Besides that, I don't see any issues with the code. Please split the
>> ConvertUTF changes into a separate patch, though.
>>
>> In terms of tests, test/Lexer/string_concat.cpp might be a good
>> example to look at for the invalid cases. For the valid cases, take a
>> look at test/CodeGen/string-literal.c .
>>
>>> Also while working on my patch I noticed the following warning:
>>>
>>>> test.cpp:33:20: warning: character unicode escape sequence too long for its type
>>>> char16_t c[] = u"\U0001F47F";
>>>> ^
>>>
>>> I found that the resulting code behaves as expected (producing the appropriate UTF-16 surrogate pair in the array). Should there really be a warning here?
>>
>> No; I'm pretty sure the warning is meant to catch L"\U0001F47F" in
>> -fshort-wchar mode.
>>
>> -Eli
>
More information about the cfe-commits
mailing list