[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Sun Sep 11 17:05:39 PDT 2011

So what all needs to be done to get the IR representation for wide string literals to work as arrays?

I've looked around a bit and found VisitStringLiteral in CGExprConstant.cpp and the following version of it seems to produce the right IR and the few simple programs I've run have seemed to work correctly. Are there more places that need changes or is this pretty much it?

Also This quick attempt is leaking memory, does llvm have a smart pointer to use or anything?

  llvm::Constant *VisitStringLiteral(StringLiteral *E) {
    assert(!E->getType()->isPointerType() && "Strings are always arrays");

    // This must be a string initializing an array in a static initializer.
    // Don't emit it as the address of the string, emit the string data itself
    // as an inline array.
    if (E->isAscii() || E->isUTF8() || E->isPascal()) {
      return llvm::ConstantArray::get(VMContext,
                                      CGM.GetStringForStringLiteral(E), false);
    } else {
      std::vector<llvm::Constant*> Elts;
      llvm::ArrayType *AType = cast<llvm::ArrayType>(ConvertType(E->getType()));
      llvm::Type *ElemTy = AType->getElementType();
      unsigned NumElements = AType->getNumElements();
      char const *data = E->getString().str().data();
      size_t size = E->getString().str().size(); 

      unsigned CharByteWidth;
      switch(E->getKind()) {
        case StringLiteral::Wide:
          CharByteWidth = CGM.getTarget().getWCharWidth();
          break;
        case StringLiteral::UTF16:
          CharByteWidth = CGM.getTarget().getChar16Width();
          break;
        case StringLiteral::UTF32:
          CharByteWidth = CGM.getTarget().getChar32Width();
          break;
        case StringLiteral::Ascii:
        case StringLiteral::UTF8:
          CharByteWidth = CGM.getTarget().getCharWidth();
          break;
      }
      assert((CharByteWidth & 7) == 0 && "Assumes character size is byte multiple");
      CharByteWidth /= 8;

      unsigned short *short_array = new unsigned short[NumElements];
      memcpy(short_array,data,size);
      unsigned int *int_array = new unsigned int[NumElements];
      memcpy(int_array,data,size);

      // NumElements includes null terminator but actual data doesn't
      for(unsigned i=0;i<NumElements-1;++i) {
        unsigned value;
        if (CharByteWidth==2) {
          value = short_array[i];
        } else if (CharByteWidth==4) {
          value = int_array[i];      
        } else {
          assert(false && "char byte width out of domain");
        }
        llvm::Constant *C = llvm::ConstantInt::get(ElemTy,value,false);
        Elts.push_back(C);
      }
      // add on the null terminator
      llvm::Constant *C = llvm::ConstantInt::get(ElemTy,0,false);
      Elts.push_back(C);

      return llvm::ConstantArray::get(AType, Elts);
    }
  }

On Sep 1, 2011, at 12:53 AM, Eli Friedman wrote:

> On Wed, Aug 31, 2011 at 9:14 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>> 
>> On Aug 29, 2011, at 2:50 PM, Eli Friedman wrote:
>> 
>>> On Sun, Aug 28, 2011 at 6:26 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>>>> Here's a set of patches updated as per your comments.
>>> 
>>> @@ -1119,20 +1133,37 @@ void StringLiteralParser::init(const Token
>>> *StringToks, unsigned NumStringToks){
>>> 
>>> /// copyStringFragment - This function copies from Start to End into ResultPtr.
>>> /// Performs widening for multi-byte characters.
>>> -void StringLiteralParser::CopyStringFragment(StringRef Fragment) {
>>> +bool StringLiteralParser::CopyStringFragment(StringRef Fragment) {
>>> +  assert(CharByteWidth==1 || CharByteWidth==2 || CharByteWidth==4);
>>> +  ConversionResult result = conversionOK;
>>>   // Copy the character span over.
>>>   if (CharByteWidth == 1) {
>>>     memcpy(ResultPtr, Fragment.data(), Fragment.size());
>>>     ResultPtr += Fragment.size();
>>> -  } else {
>>> -    // Note: our internal rep of wide char tokens is always little-endian.
>>> -    for (StringRef::iterator I=Fragment.begin(), E=Fragment.end(); I!=E; ++I) {
>>> -      *ResultPtr++ = *I;
>>> -      // Add zeros at the end.
>>> -      for (unsigned i = 1, e = CharByteWidth; i != e; ++i)
>>> -        *ResultPtr++ = 0;
>>> -    }
>>> +  } else if (CharByteWidth == 2) {
>>> +    UTF8 const *sourceStart = (UTF8 const *)Fragment.data();
>>> +    // Is ResultPtr properly aligned? are we violating aliasing rules?
>>> +    UTF16 *targetStart = reinterpret_cast<UTF16*>(ResultPtr);
>>> 
>>> Err, oww, sorry about not noticing this earlier...this breaks
>>> big-endian hosts.  At least for now, please use a local buffer and do
>>> a copy which preserves the current representation.  (It would be nice
>>> to fix strings to use native-endian arrays in a subsequent patch, but
>>> that's a bit more involved.)
>> 
>> It appears to me that it's already messed up on big endian platforms. With git revision f84139a1331c6:
>> 
>> Source tmp.cpp:
>> char32_t c[] = U"abc";
>> 
>> Command:
>> Release+Asserts/bin/clang++ -std=c++0x -stdlib=libc++ -c tmp.cpp -arch ppc -emit-llvm -O0 -o - | llvm-dis
>> 
>> Output:
>> ; ModuleID = '<stdin>'
>> target datalayout = "E-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v128:128:128-n32"
>> target triple = "powerpc-apple-darwin11.1.0"
>> 
>> @c = global [16 x i8] c"a\00\00\00b\00\00\00c\00\00\00\00\00\00\00", align 4
>> 
>> As you note the current code always does little endian for wide characters, but it appears that this isn't taken into account during codegen in ActOnStringLiteral. I think my new code would actually work around the problem when building for a big-endian target if the host is big-endian (though I haven't actually tested it).
>> 
>> I'll go ahead and change my code to maintain the current little-endian internal representation for now; I may be missing something and its probably better not to disturb the behavior until the other issues are also fixed anyway.
> 
> Okay.
> 
>> I would like to have the internal representation be native-endian arrays as well. Can you point out to me what you think the issues will be?
> 
> I'm not really expecting difficulties, just making sure everything
> uses the correct types to access the string where appropriate.
> Actually, there isn't much that actually examines the contents of the
> string between the string literal parser and IRGen; I guess there's
> the printf checker and a couple of other things...
> 
>> Also I notice that the constant generated for:
>> 
>> char32_t c[] = {U'a',U'b',U'c',U'\0'};
>> 
>> is
>> 
>> @c = global [4 x i32] [i32 97, i32 98, i32 99, i32 0], align 4
>> 
>> It seems to me that this is preferable to the [16 x i8] definition even for string literals. Would this be acceptable? Would it be a large change or one I can make just in ActOnStringLiteral?
> 
> Yes; that's part of the change to use native-endian arrays, I think...
> do everything consistently, and nothing has to worry about endianness
> on the strings until the object file is written.
> 
>>> 
>>> +// RUN: %clang_cc1 -std=c++0x -fsyntax-only -verify %s
>>> +
>>> +void f() {
>>> +    wchar_t const *a = L"�����"; // expected-error {{ illegal
>>> sequence in string literal }}
>>> +
>>> +    char16_t const *b = u"�����"; // expected-error {{ illegal
>>> sequence in string literal }}
>>> +    char32_t const *c = U"�����"; // expected-error {{ illegal
>>> sequence in string literal }}
>>> +    wchar_t const *d = LR"(�����)"; // expected-error {{ illegal
>>> sequence in string literal }}
>>> +    char16_t const *e = uR"(�����)"; // expected-error {{ illegal
>>> sequence in string literal }}
>>> +    char32_t const *f = UR"(�����)"; // expected-error {{ illegal
>>> sequence in string literal }}
>>> +}
>>> \ No newline at end of file
>>> --
>>> 
>>> Please put a newline at the end of the file.  Please, please put in a
>>> comment that people should be very careful when they edit this file,
>>> in case their text editor "fixes" the bad encoding.
>> 
>> Thanks, I'll do that. Although this file is consistently ISO-8859-1 and shouldn't cause much of a problem. The other test file, however, is mixed ISO-8859-1 and UTF-8, so it could be confusing. Xcode guesses at its encoding and comes up with Mac Roman. (although it's interesting that gitk shows all the string literals in their correct encoding. Apparently it guesses encodings on a line by line basis.)
> 
> That's... interesting.
> 
> -Eli