[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Mon Sep 12 10:48:16 PDT 2011

On Sun, Sep 11, 2011 at 5:05 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
> So what all needs to be done to get the IR representation for wide string literals to work as arrays?
>
> I've looked around a bit and found VisitStringLiteral in CGExprConstant.cpp and the following version of it seems to produce the right IR and the few simple programs I've run have seemed to work correctly. Are there more places that need changes or is this pretty much it?
>
> Also This quick attempt is leaking memory, does llvm have a smart pointer to use or anything?

llvm::OwningPtr.

>
>  llvm::Constant *VisitStringLiteral(StringLiteral *E) {
>    assert(!E->getType()->isPointerType() && "Strings are always arrays");
>
>    // This must be a string initializing an array in a static initializer.
>    // Don't emit it as the address of the string, emit the string data itself
>    // as an inline array.
>    if (E->isAscii() || E->isUTF8() || E->isPascal()) {
>      return llvm::ConstantArray::get(VMContext,
>                                      CGM.GetStringForStringLiteral(E), false);
>    } else {
>      std::vector<llvm::Constant*> Elts;
>      llvm::ArrayType *AType = cast<llvm::ArrayType>(ConvertType(E->getType()));
>      llvm::Type *ElemTy = AType->getElementType();
>      unsigned NumElements = AType->getNumElements();
>      char const *data = E->getString().str().data();
>      size_t size = E->getString().str().size();
>
>      unsigned CharByteWidth;
>      switch(E->getKind()) {
>        case StringLiteral::Wide:
>          CharByteWidth = CGM.getTarget().getWCharWidth();
>          break;
>        case StringLiteral::UTF16:
>          CharByteWidth = CGM.getTarget().getChar16Width();
>          break;
>        case StringLiteral::UTF32:
>          CharByteWidth = CGM.getTarget().getChar32Width();
>          break;
>        case StringLiteral::Ascii:
>        case StringLiteral::UTF8:
>          CharByteWidth = CGM.getTarget().getCharWidth();
>          break;
>      }
>      assert((CharByteWidth & 7) == 0 && "Assumes character size is byte multiple");
>      CharByteWidth /= 8;
>
>      unsigned short *short_array = new unsigned short[NumElements];
>      memcpy(short_array,data,size);
>      unsigned int *int_array = new unsigned int[NumElements];
>      memcpy(int_array,data,size);
>
>      // NumElements includes null terminator but actual data doesn't
>      for(unsigned i=0;i<NumElements-1;++i) {
>        unsigned value;
>        if (CharByteWidth==2) {
>          value = short_array[i];
>        } else if (CharByteWidth==4) {
>          value = int_array[i];
>        } else {
>          assert(false && "char byte width out of domain");
>        }
>        llvm::Constant *C = llvm::ConstantInt::get(ElemTy,value,false);
>        Elts.push_back(C);
>      }
>      // add on the null terminator
>      llvm::Constant *C = llvm::ConstantInt::get(ElemTy,0,false);
>      Elts.push_back(C);
>
>      return llvm::ConstantArray::get(AType, Elts);
>    }
>  }

I'm not entirely sure why you think the memcpy is necessary, but this
appears to be in the right direction.

Taking a quick look, I think you're missing a codepath... I think <int
x[] = L"asdf";> and <int *x = L"asdf"> use different codepaths.

-Eli