[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases

Sun Oct 23 22:14:11 PDT 2011

On Sun, Oct 23, 2011 at 8:51 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
> Here's a new set of patches for review. In addition to the previous changes in string literal encoding, this set changes the way non-char based strings are represented in LLVM IR. Now non-char based strings are natively arrays of i32 and i16 based on the size of the character element used in the string constant.
>
>
>
>
> As per the previous comments, I fixed the newline at the end of the new test files and added comments on the test files' encodings. Patch 3 is the one dealing with changing the representation of string constants to use arrays with elements of the right size for wchar_t, char16_t, and char32_t strings. Patch 5 modifies preexisting tests that expect the old representation.
>
> There's still one test failure that I'm not sure about. CodeGen/global-init.c expects a line like:
>
> @l = global { [24 x i8], i32 } { [24 x i8] c"f\00\00\00o\00\00\00o\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00", i32 1 }
>
> and building with these patches produce a different result:
>
> %struct.K = type { [6 x i32], i32 }
> ...
> @l = global %struct.K { [6 x i32] [i32 102, i32 111, i32 111, i32 0, i32 65, i32 0], i32 1 }, align 4
>
> This includes a change other than just the string representation. Now the struct type K is created and referenced instead of being defined inline. I don't think this is related to any changes I made, but I don't what to just update the test with the output I get without being certain that it's correct.

That's fine: the reason you're seeing the change is that where
possible, we try to generate globals with an appropriate type.  Given
the old representation, that was not possible for this case; with the
new representation, it is.

> Let me know what you think.

diff --git a/lib/Serialization/ASTReaderStmt.cpp
b/lib/Serialization/ASTReaderStmt.cpp
index 85d0f92..aeb40fb 100644
--- a/lib/Serialization/ASTReaderStmt.cpp
+++ b/lib/Serialization/ASTReaderStmt.cpp
@@ -372,12 +372,13 @@ void ASTStmtReader::VisitStringLiteral(StringLiteral *E) {
   assert(Record[Idx] == E->getNumConcatenated() &&
          "Wrong number of concatenated tokens!");
   ++Idx;
-  E->Kind = static_cast<StringLiteral::StringKind>(Record[Idx++]);
-  E->IsPascal = Record[Idx++];
+  StringLiteral::StringKind kind =
+        static_cast<StringLiteral::StringKind>(Record[Idx++]);
+  bool isPascal = Record[Idx++];

   // Read string data
   llvm::SmallString<16> Str(&Record[Idx], &Record[Idx] + Len);
-  E->setString(Reader.getContext(), Str.str());
+  E->setString(Reader.getContext(), Str.str(),kind, isPascal);
   Idx += Len;

Whitespace; also, you might want to add a FIXME about how we read in
the string.  (Ideally, the format of serialized AST files should not
vary across platforms.)

@@ -2022,14 +2054,24 @@
CodeGenModule::GetAddrOfConstantStringFromLiteral(const StringLiteral
*S) {
   // FIXME: This can be more efficient.
   // FIXME: We shouldn't need to bitcast the constant in the wide string case.
   CharUnits Align = getContext().getTypeAlignInChars(S->getType());
-  llvm::Constant *C = GetAddrOfConstantString(GetStringForStringLiteral(S),
-                                              /* GlobalName */ 0,
-                                              Align.getQuantity());
-  if (S->isWide() || S->isUTF16() || S->isUTF32()) {
-    llvm::Type *DestTy =
-        llvm::PointerType::getUnqual(getTypes().ConvertType(S->getType()));
-    C = llvm::ConstantExpr::getBitCast(C, DestTy);
+  if (S->isAscii() || S->isUTF8()) {
+    return GetAddrOfConstantString(GetStringForStringLiteral(S),
+                                   /* GlobalName */ 0,
+                                   Align.getQuantity());
   }

Could you add an assertion to GetStringForStringLiteral that we only
call it for non-wide strings?

+  llvm::Constant *C = GetConstantArrayFromStringLiteral(S);
+  llvm::GlobalVariable *GV =
+    new llvm::GlobalVariable(getModule(),C->getType(),
+                             !Features.WritableStrings,
+                             llvm::GlobalValue::PrivateLinkage,
+                             C,".str");
+  GV->setAlignment(Align.getQuantity());
+  GV->setUnnamedAddr(true);
+
+  llvm::Type *DestTy =
+    llvm::PointerType::getUnqual(getTypes().ConvertType(S->getType()));
+  C = llvm::ConstantExpr::getBitCast(GV, DestTy);
   return C;

The bitcast shouldn't be necessary with your changes.

Also, it looks like this causes us to not memoize wide strings; that's
okay for the moment, but please add a FIXME.

+llvm::Constant *
+CodeGenModule::GetConstantArrayFromStringLiteral(const StringLiteral *E) {
+  assert(!E->getType()->isPointerType() && "Strings are always arrays");
+
+  unsigned CharByteWidth = StringLiteral::mapCharByteWidth(getTarget(),
+                                                           E->getKind());

Please add a getCharByteWidth() accessor to StringLiteral, and make
mapCharByteWidth private.

   StringRef getString() const {
-    return StringRef(StrData, ByteLength);
+    assert((CharByteWidth==1 || CharByteWidth==2 || CharByteWidth==4)
+           && "unsupported CharByteWidth");
+    if (CharByteWidth==4) {
+      return StringRef(reinterpret_cast<const char*>(StrData.asUInt32),
+                       getByteLength());
+    } else if (CharByteWidth==2) {
+      return StringRef(reinterpret_cast<const char*>(StrData.asUInt16),
+                       getByteLength());
+    } else {
+      return StringRef(StrData.asChar, getByteLength());
+    }
+  }

Returning a StringRef for something that isn't using the right
representation seems like a bad idea; maybe this should just assert
that CharByteWidth==1, or something like that.

   bool containsNonAsciiOrNull() const {
+    if(!isAscii() && !isUTF8()) {
+      return true;
+    }
     StringRef Str = getString();
     for (unsigned i = 0, e = Str.size(); i != e; ++i)
       if (!isascii(Str[i]) || !Str[i])
         return true;
     return false;
   }

What is the point of this change?


The rest of the patches seem fine.

-Eli