[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Tue Sep 13 01:32:37 PDT 2011

On Mon, Sep 12, 2011 at 2:34 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>
> On Sep 12, 2011, at 1:48 PM, Eli Friedman wrote:
>
>> On Sun, Sep 11, 2011 at 5:05 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>>> So what all needs to be done to get the IR representation for wide string literals to work as arrays?
>>>
>>> I've looked around a bit and found VisitStringLiteral in CGExprConstant.cpp and the following version of it seems to produce the right IR and the few simple programs I've run have seemed to work correctly. Are there more places that need changes or is this pretty much it?
>>>
>>> Also This quick attempt is leaking memory, does llvm have a smart pointer to use or anything?
>>
>> llvm::OwningPtr.
>
> Ah, thank you.
>
>>
>>>
>>>  llvm::Constant *VisitStringLiteral(StringLiteral *E) {
>>>    assert(!E->getType()->isPointerType() && "Strings are always arrays");
>>>
>>>    // This must be a string initializing an array in a static initializer.
>>>    // Don't emit it as the address of the string, emit the string data itself
>>>    // as an inline array.
>>>    if (E->isAscii() || E->isUTF8() || E->isPascal()) {
>>>      return llvm::ConstantArray::get(VMContext,
>>>                                      CGM.GetStringForStringLiteral(E), false);
>>>    } else {
>>>      std::vector<llvm::Constant*> Elts;
>>>      llvm::ArrayType *AType = cast<llvm::ArrayType>(ConvertType(E->getType()));
>>>      llvm::Type *ElemTy = AType->getElementType();
>>>      unsigned NumElements = AType->getNumElements();
>>>      char const *data = E->getString().str().data();
>>>      size_t size = E->getString().str().size();
>>>
>>>      unsigned CharByteWidth;
>>>      switch(E->getKind()) {
>>>        case StringLiteral::Wide:
>>>          CharByteWidth = CGM.getTarget().getWCharWidth();
>>>          break;
>>>        case StringLiteral::UTF16:
>>>          CharByteWidth = CGM.getTarget().getChar16Width();
>>>          break;
>>>        case StringLiteral::UTF32:
>>>          CharByteWidth = CGM.getTarget().getChar32Width();
>>>          break;
>>>        case StringLiteral::Ascii:
>>>        case StringLiteral::UTF8:
>>>          CharByteWidth = CGM.getTarget().getCharWidth();
>>>          break;
>>>      }
>>>      assert((CharByteWidth & 7) == 0 && "Assumes character size is byte multiple");
>>>      CharByteWidth /= 8;
>>>
>>>      unsigned short *short_array = new unsigned short[NumElements];
>>>      memcpy(short_array,data,size);
>>>      unsigned int *int_array = new unsigned int[NumElements];
>>>      memcpy(int_array,data,size);
>>>
>>>      // NumElements includes null terminator but actual data doesn't
>>>      for(unsigned i=0;i<NumElements-1;++i) {
>>>        unsigned value;
>>>        if (CharByteWidth==2) {
>>>          value = short_array[i];
>>>        } else if (CharByteWidth==4) {
>>>          value = int_array[i];
>>>        } else {
>>>          assert(false && "char byte width out of domain");
>>>        }
>>>        llvm::Constant *C = llvm::ConstantInt::get(ElemTy,value,false);
>>>        Elts.push_back(C);
>>>      }
>>>      // add on the null terminator
>>>      llvm::Constant *C = llvm::ConstantInt::get(ElemTy,0,false);
>>>      Elts.push_back(C);
>>>
>>>      return llvm::ConstantArray::get(AType, Elts);
>>>    }
>>>  }
>>
>> I'm not entirely sure why you think the memcpy is necessary, but this
>> appears to be in the right direction.
>
> When started out with just using reinterpret_cast, e.g. 'value = reinterpret_cast<unsigned const *>(data)[i]', sometimes the values that would come out were garbage. For example for U"4abc" I'd get something like [5 x i32] [i32 52, i32 97, i32 98, i32 73476345, i32 0]. I couldn't figure out why since the byte values were fine. But the pointer might not be properly aligned and using reinterpret_cast this way isn't defined anyway. If you have a suggestion about doing this portably without the memory allocation/copy then I'd like to avoid them.

Err, the strange results are probably the result of the use-after-free
of the variable "data".

That said, you shouldn't need the reinterpret cast or the memcpy; why
not just allocate the buffer in the StringLiteral using the right type
in the first place?

-Eli