[cfe-dev] [PATCH] C++0x unicode string and character literals now with test cases
Eli Friedman
eli.friedman at gmail.com
Sun Oct 23 22:14:11 PDT 2011
On Sun, Oct 23, 2011 at 8:51 PM, Seth Cantrell <seth.cantrell at gmail.com> wrote:
> Here's a new set of patches for review. In addition to the previous changes in string literal encoding, this set changes the way non-char based strings are represented in LLVM IR. Now non-char based strings are natively arrays of i32 and i16 based on the size of the character element used in the string constant.
>
>
>
>
> As per the previous comments, I fixed the newline at the end of the new test files and added comments on the test files' encodings. Patch 3 is the one dealing with changing the representation of string constants to use arrays with elements of the right size for wchar_t, char16_t, and char32_t strings. Patch 5 modifies preexisting tests that expect the old representation.
>
> There's still one test failure that I'm not sure about. CodeGen/global-init.c expects a line like:
>
> @l = global { [24 x i8], i32 } { [24 x i8] c"f\00\00\00o\00\00\00o\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00", i32 1 }
>
> and building with these patches produce a different result:
>
> %struct.K = type { [6 x i32], i32 }
> ...
> @l = global %struct.K { [6 x i32] [i32 102, i32 111, i32 111, i32 0, i32 65, i32 0], i32 1 }, align 4
>
> This includes a change other than just the string representation. Now the struct type K is created and referenced instead of being defined inline. I don't think this is related to any changes I made, but I don't what to just update the test with the output I get without being certain that it's correct.
That's fine: the reason you're seeing the change is that where
possible, we try to generate globals with an appropriate type. Given
the old representation, that was not possible for this case; with the
new representation, it is.
> Let me know what you think.
diff --git a/lib/Serialization/ASTReaderStmt.cpp
b/lib/Serialization/ASTReaderStmt.cpp
index 85d0f92..aeb40fb 100644
--- a/lib/Serialization/ASTReaderStmt.cpp
+++ b/lib/Serialization/ASTReaderStmt.cpp
@@ -372,12 +372,13 @@ void ASTStmtReader::VisitStringLiteral(StringLiteral *E) {
assert(Record[Idx] == E->getNumConcatenated() &&
"Wrong number of concatenated tokens!");
++Idx;
- E->Kind = static_cast<StringLiteral::StringKind>(Record[Idx++]);
- E->IsPascal = Record[Idx++];
+ StringLiteral::StringKind kind =
+ static_cast<StringLiteral::StringKind>(Record[Idx++]);
+ bool isPascal = Record[Idx++];
// Read string data
llvm::SmallString<16> Str(&Record[Idx], &Record[Idx] + Len);
- E->setString(Reader.getContext(), Str.str());
+ E->setString(Reader.getContext(), Str.str(),kind, isPascal);
Idx += Len;
Whitespace; also, you might want to add a FIXME about how we read in
the string. (Ideally, the format of serialized AST files should not
vary across platforms.)
@@ -2022,14 +2054,24 @@
CodeGenModule::GetAddrOfConstantStringFromLiteral(const StringLiteral
*S) {
// FIXME: This can be more efficient.
// FIXME: We shouldn't need to bitcast the constant in the wide string case.
CharUnits Align = getContext().getTypeAlignInChars(S->getType());
- llvm::Constant *C = GetAddrOfConstantString(GetStringForStringLiteral(S),
- /* GlobalName */ 0,
- Align.getQuantity());
- if (S->isWide() || S->isUTF16() || S->isUTF32()) {
- llvm::Type *DestTy =
- llvm::PointerType::getUnqual(getTypes().ConvertType(S->getType()));
- C = llvm::ConstantExpr::getBitCast(C, DestTy);
+ if (S->isAscii() || S->isUTF8()) {
+ return GetAddrOfConstantString(GetStringForStringLiteral(S),
+ /* GlobalName */ 0,
+ Align.getQuantity());
}
Could you add an assertion to GetStringForStringLiteral that we only
call it for non-wide strings?
+ llvm::Constant *C = GetConstantArrayFromStringLiteral(S);
+ llvm::GlobalVariable *GV =
+ new llvm::GlobalVariable(getModule(),C->getType(),
+ !Features.WritableStrings,
+ llvm::GlobalValue::PrivateLinkage,
+ C,".str");
+ GV->setAlignment(Align.getQuantity());
+ GV->setUnnamedAddr(true);
+
+ llvm::Type *DestTy =
+ llvm::PointerType::getUnqual(getTypes().ConvertType(S->getType()));
+ C = llvm::ConstantExpr::getBitCast(GV, DestTy);
return C;
The bitcast shouldn't be necessary with your changes.
Also, it looks like this causes us to not memoize wide strings; that's
okay for the moment, but please add a FIXME.
+llvm::Constant *
+CodeGenModule::GetConstantArrayFromStringLiteral(const StringLiteral *E) {
+ assert(!E->getType()->isPointerType() && "Strings are always arrays");
+
+ unsigned CharByteWidth = StringLiteral::mapCharByteWidth(getTarget(),
+ E->getKind());
Please add a getCharByteWidth() accessor to StringLiteral, and make
mapCharByteWidth private.
StringRef getString() const {
- return StringRef(StrData, ByteLength);
+ assert((CharByteWidth==1 || CharByteWidth==2 || CharByteWidth==4)
+ && "unsupported CharByteWidth");
+ if (CharByteWidth==4) {
+ return StringRef(reinterpret_cast<const char*>(StrData.asUInt32),
+ getByteLength());
+ } else if (CharByteWidth==2) {
+ return StringRef(reinterpret_cast<const char*>(StrData.asUInt16),
+ getByteLength());
+ } else {
+ return StringRef(StrData.asChar, getByteLength());
+ }
+ }
Returning a StringRef for something that isn't using the right
representation seems like a bad idea; maybe this should just assert
that CharByteWidth==1, or something like that.
bool containsNonAsciiOrNull() const {
+ if(!isAscii() && !isUTF8()) {
+ return true;
+ }
StringRef Str = getString();
for (unsigned i = 0, e = Str.size(); i != e; ++i)
if (!isascii(Str[i]) || !Str[i])
return true;
return false;
}
What is the point of this change?
The rest of the patches seem fine.
-Eli
More information about the cfe-dev
mailing list