[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Seth Cantrell seth.cantrell at gmail.com
Mon Oct 24 18:43:56 PDT 2011


On Oct 24, 2011, at 1:14 AM, Eli Friedman wrote:

> On Sun, Oct 23, 2011 at 8:51 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
>> Here's a new set of patches for review. In addition to the previous changes in string literal encoding, this set changes the way non-char based strings are represented in LLVM IR. Now non-char based strings are natively arrays of i32 and i16 based on the size of the character element used in the string constant.
>> 
>> 
>> 
>> 
>> As per the previous comments, I fixed the newline at the end of the new test files and added comments on the test files' encodings. Patch 3 is the one dealing with changing the representation of string constants to use arrays with elements of the right size for wchar_t, char16_t, and char32_t strings. Patch 5 modifies preexisting tests that expect the old representation.
>> 
>> There's still one test failure that I'm not sure about. CodeGen/global-init.c expects a line like:
>> 
>> @l = global { [24 x i8], i32 } { [24 x i8] c"f\00\00\00o\00\00\00o\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00", i32 1 }
>> 
>> and building with these patches produce a different result:
>> 
>> %struct.K = type { [6 x i32], i32 }
>> ...
>> @l = global %struct.K { [6 x i32] [i32 102, i32 111, i32 111, i32 0, i32 65, i32 0], i32 1 }, align 4
>> 
>> This includes a change other than just the string representation. Now the struct type K is created and referenced instead of being defined inline. I don't think this is related to any changes I made, but I don't what to just update the test with the output I get without being certain that it's correct.
> 
> That's fine: the reason you're seeing the change is that where
> possible, we try to generate globals with an appropriate type.  Given
> the old representation, that was not possible for this case; with the
> new representation, it is.
> 
>> Let me know what you think.
> 
> diff --git a/lib/Serialization/ASTReaderStmt.cpp
> b/lib/Serialization/ASTReaderStmt.cpp
> index 85d0f92..aeb40fb 100644
> --- a/lib/Serialization/ASTReaderStmt.cpp
> +++ b/lib/Serialization/ASTReaderStmt.cpp
> @@ -372,12 +372,13 @@ void ASTStmtReader::VisitStringLiteral(StringLiteral *E) {
>   assert(Record[Idx] == E->getNumConcatenated() &&
>          "Wrong number of concatenated tokens!");
>   ++Idx;
> -  E->Kind = static_cast<StringLiteral::StringKind>(Record[Idx++]);
> -  E->IsPascal = Record[Idx++];
> +  StringLiteral::StringKind kind =
> +        static_cast<StringLiteral::StringKind>(Record[Idx++]);
> +  bool isPascal = Record[Idx++];
> 
>   // Read string data
>   llvm::SmallString<16> Str(&Record[Idx], &Record[Idx] + Len);
> -  E->setString(Reader.getContext(), Str.str());
> +  E->setString(Reader.getContext(), Str.str(),kind, isPascal);
>   Idx += Len;
> 
> Whitespace; also, you might want to add a FIXME about how we read in
> the string.  (Ideally, the format of serialized AST files should not
> vary across platforms.)

I'm not sure that's possible in this case.

Consider the literals L"\xD83D\xDC7F" and L"\U0001F47F" (The former being the UTF-16 surrogate pair corresponding to the latter). For a Windows target these two literals are indistinguishable after we do the initial translation in Sema::ActOnStringLiteral. The resulting StringLiteral will store the same data, and have the same Kind and CharByteWidth. But on a platform with sizeof(wchar_t)==4 the two will not be the same.

Another example is the literal L"\x0010FFFD", there is no valid representation when sizeof(wchar_t)==2, but there is for 4 byte wchar_ts. On Windows this literal would trigger an error.

So I don't believe it is possible to create the correct representation of the string literal for any given platform using only the representation we get from some other platform.

My current code just uses the serialized Kind and the target info for the current context to determine the CharByteWidth, and then reads the serialized string data in units of that size. It works correctly iff the targets share the same CharByteWidth for the kind. If they don't share the same CharByteSize then you'll either get a string of the wrong length or you'll hit an assertion I put in.

So I've added a fixme in setString about the assumption that the input data is compatible with the current target, but otherwise just I read the data directly with no attempt at translation. There are some things that could be done to get somewhat better behavior when the targets aren't compatible, but none that I can think of that are really correct using the currently available info.

I think the information that's needed to handle everything correctly is a flag for each code unit in the StringLiteral indicating whether or not it came from a hex escape sequence, and then the source encoding. Then we could just take runs of code units not from hex escapes and translate them to the target encoding, and just literally copy over units that do come from hex escapes, and produce an error or warning when a unit from a hex escape is too large for the new target.

> 
> @@ -2022,14 +2054,24 @@
> CodeGenModule::GetAddrOfConstantStringFromLiteral(const StringLiteral
> *S) {
>   // FIXME: This can be more efficient.
>   // FIXME: We shouldn't need to bitcast the constant in the wide string case.
>   CharUnits Align = getContext().getTypeAlignInChars(S->getType());
> -  llvm::Constant *C = GetAddrOfConstantString(GetStringForStringLiteral(S),
> -                                              /* GlobalName */ 0,
> -                                              Align.getQuantity());
> -  if (S->isWide() || S->isUTF16() || S->isUTF32()) {
> -    llvm::Type *DestTy =
> -        llvm::PointerType::getUnqual(getTypes().ConvertType(S->getType()));
> -    C = llvm::ConstantExpr::getBitCast(C, DestTy);
> +  if (S->isAscii() || S->isUTF8()) {
> +    return GetAddrOfConstantString(GetStringForStringLiteral(S),
> +                                   /* GlobalName */ 0,
> +                                   Align.getQuantity());
>   }
> 
> Could you add an assertion to GetStringForStringLiteral that we only
> call it for non-wide strings?

Okay. I've added that and removed the code in GetStringForStringLiteral that was handling wide strings.

> 
> +  llvm::Constant *C = GetConstantArrayFromStringLiteral(S);
> +  llvm::GlobalVariable *GV =
> +    new llvm::GlobalVariable(getModule(),C->getType(),
> +                             !Features.WritableStrings,
> +                             llvm::GlobalValue::PrivateLinkage,
> +                             C,".str");
> +  GV->setAlignment(Align.getQuantity());
> +  GV->setUnnamedAddr(true);
> +
> +  llvm::Type *DestTy =
> +    llvm::PointerType::getUnqual(getTypes().ConvertType(S->getType()));
> +  C = llvm::ConstantExpr::getBitCast(GV, DestTy);
>   return C;
> 
> The bitcast shouldn't be necessary with your changes.
> 
> Also, it looks like this causes us to not memoize wide strings; that's
> okay for the moment, but please add a FIXME.

Yeah. Fixme added.

> 
> +llvm::Constant *
> +CodeGenModule::GetConstantArrayFromStringLiteral(const StringLiteral *E) {
> +  assert(!E->getType()->isPointerType() && "Strings are always arrays");
> +
> +  unsigned CharByteWidth = StringLiteral::mapCharByteWidth(getTarget(),
> +                                                           E->getKind());
> 
> Please add a getCharByteWidth() accessor to StringLiteral, and make
> mapCharByteWidth private.

Done.

> 
>   StringRef getString() const {
> -    return StringRef(StrData, ByteLength);
> +    assert((CharByteWidth==1 || CharByteWidth==2 || CharByteWidth==4)
> +           && "unsupported CharByteWidth");
> +    if (CharByteWidth==4) {
> +      return StringRef(reinterpret_cast<const char*>(StrData.asUInt32),
> +                       getByteLength());
> +    } else if (CharByteWidth==2) {
> +      return StringRef(reinterpret_cast<const char*>(StrData.asUInt16),
> +                       getByteLength());
> +    } else {
> +      return StringRef(StrData.asChar, getByteLength());
> +    }
> +  }
> 
> Returning a StringRef for something that isn't using the right
> representation seems like a bad idea; maybe this should just assert
> that CharByteWidth==1, or something like that.

Well, this is currently used in ASTStmtWriter::VisitStringLiteral() to serialize the StringLiteral. Also setString() takes a StringRef.

> 
>   bool containsNonAsciiOrNull() const {
> +    if(!isAscii() && !isUTF8()) {
> +      return true;
> +    }
>     StringRef Str = getString();
>     for (unsigned i = 0, e = Str.size(); i != e; ++i)
>       if (!isascii(Str[i]) || !Str[i])
>         return true;
>     return false;
>   }
> 
> What is the point of this change?

That can probably be taken out. The implementation using getString() is only correct for char strings, but now that I've added StringLiteral::getCodeUnit() a better implementation would be:

  bool containsNonAsciiOrNull() const {
    for (unsigned i = 0, e = Length; i != e; ++i)
      if (!isascii(getCodeUnit(i)) || !getCodeUnit(i))
        return true;
    return false;
  }

Although that's assuming that people use this function to check the code unit values in the string rather than to check if the byte representation contains internal null bytes or bytes out of the ascii range.

Looking at the uses of this function I see that people are using this function to determine that the string contains UTF-8 data and to then convert that UTF-8 data to UTF-16. This doesn't seem right now that StringLiteral holds wide strings. I guess I should devise some test cases that trigger this bug on trunk and file a bug report?

> 
> 
> The rest of the patches seem fine.
> 
> -Eli

Thanks!

No new patch attached until I figure out the serialization stuff. Suggestions welcome.

- Seth



More information about the cfe-dev mailing list