[llvm] [llvm-rc] Concatenate consecutive string tokens (PR #68685)

Tue Oct 10 04:05:56 PDT 2023

llvmbot wrote:




@llvm/pr-subscribers-platform-windows

Author: Martin Storsjö (mstorsjo)

<details>
<summary>Changes</summary>

A number of constructs allow more than one string literal for what represents one single string - version info values, string tables, user data resources, etc. This mostly works like how a C compiler merges consecutive string literals, like "foo" "bar" as producing the same as "foobar". This is beneficial for producing strings with some elements provided by the preprocessor.

MS rc.exe only supports this in a few fixed locations (most of which are already supported), while GNU windres supports this essentially anywhere in any string. See
b989fcbae6f179ad887d19ceef83ace1c00b87cc for one recent change that extended support for this in one specific resource.

A reasonable use case for multiple concatenated string literals that GNU windres accepts is `1 ICON DIR "/name.ico"`, where the directory is provided via the preprocessor, expanding to another string literal; this is https://github.com/llvm/llvm-project/issues/51286.

To support this, extend the tokenizer to concatenate consecutive quoted string tokens.

The tokenizer only stores StringRefs that point into the input file - including the surrounding quotes and potential leading L to denote wide strings. In order to concatenate consecutive string tokens into one, we just fuse the StringRef ranges (including whatever whitespace there is between the tokens) and store that within the single string token.

At the end, when we are going to output something based on the string contents, we can split the single StringRef into a vector of StringRefs and then loop over them. The string tokens are currently used within ResourceFileWriter either via the processString function, outputting UTF16 data to write into the output, or the stripQuotes function, giving a StringRef to the raw contents of the string, e.g. for use for loading a file.

Intercept all these cases and pass them via functions that split the multi-string token into one or more strings for each string literal, and then process them one at a time with the existing functions.

This contains tests for the following cases:
- Tokenizer test
- Test of including an icon from a path built from multiple tokens, which is the actual intended use case.
- Test an accelerator string provided as multiple strings. (This isn't a case that users reasonbly would use, but due to the toknizer change, the ResourceFileWriter::writeSingleAccelerator could now encounter such input data.)

This change also allows simplifying the earlier added explicit support of multiple string literals in constructs such as version info and string tables, as later cleanup commits.

---
Full diff: https://github.com/llvm/llvm-project/pull/68685.diff


8 Files Affected:

- (added) llvm/test/tools/llvm-rc/Inputs/accelerators-split-token.rc (+5) 
- (added) llvm/test/tools/llvm-rc/Inputs/split-path.rc (+1) 
- (modified) llvm/test/tools/llvm-rc/Inputs/tokens.rc (+3) 
- (added) llvm/test/tools/llvm-rc/accelerators-split-token.test (+13) 
- (added) llvm/test/tools/llvm-rc/split-path.test (+6) 
- (modified) llvm/test/tools/llvm-rc/tokenizer.test (+2) 
- (modified) llvm/tools/llvm-rc/ResourceFileWriter.cpp (+108-28) 
- (modified) llvm/tools/llvm-rc/ResourceScriptToken.cpp (+10) 


``````````diff

diff --git a/llvm/test/tools/llvm-rc/Inputs/accelerators-split-token.rc b/llvm/test/tools/llvm-rc/Inputs/accelerators-split-token.rc
new file mode 100644
index 000000000000000..24377e014caf5e3
--- /dev/null
+++ b/llvm/test/tools/llvm-rc/Inputs/accelerators-split-token.rc
@@ -0,0 +1,5 @@
+1 ACCELERATORS
+LANGUAGE 5, 1
+{
+  "^" "a", 18
+}
diff --git a/llvm/test/tools/llvm-rc/Inputs/split-path.rc b/llvm/test/tools/llvm-rc/Inputs/split-path.rc
new file mode 100644
index 000000000000000..4f60b6f78ae15c5
--- /dev/null
+++ b/llvm/test/tools/llvm-rc/Inputs/split-path.rc
@@ -0,0 +1 @@
+100 ICON "subdir" "/icon-new.ico"
diff --git a/llvm/test/tools/llvm-rc/Inputs/tokens.rc b/llvm/test/tools/llvm-rc/Inputs/tokens.rc
index 6a781202a7e3721..7e1c6e923d25e14 100644
--- a/llvm/test/tools/llvm-rc/Inputs/tokens.rc
+++ b/llvm/test/tools/llvm-rc/Inputs/tokens.rc
@@ -4,6 +4,9 @@ identifier-with-dashes
 
 "RC string test.",L"Another RC string test.'&{",42,100
 
+"concatenated"  L"string"
+"token"
+
 Block Comment Ident /*block /* // comment */ ifier
 
 Line Comment // Identifier /*
diff --git a/llvm/test/tools/llvm-rc/accelerators-split-token.test b/llvm/test/tools/llvm-rc/accelerators-split-token.test
new file mode 100644
index 000000000000000..9414c787e0f24a2
--- /dev/null
+++ b/llvm/test/tools/llvm-rc/accelerators-split-token.test
@@ -0,0 +1,13 @@
+; RUN: llvm-rc -no-preprocess /FO %t -- %p/Inputs/accelerators-split-token.rc
+; RUN: llvm-readobj %t | FileCheck %s
+
+; CHECK: Resource type (int): ACCELERATOR (ID 9)
+; CHECK-NEXT: Resource name (int): 1
+; CHECK-NEXT: Data version: 0
+; CHECK-NEXT: Memory flags: 0x30
+; CHECK-NEXT: Language ID: 1029
+; CHECK-NEXT: Version (major): 0
+; CHECK-NEXT: Version (minor): 0
+; CHECK-NEXT: Characteristics: 0
+; CHECK-NEXT: Data size: 8
+; CHECK-NEXT: Data:: (80 00 01 00 12 00 00 00)
diff --git a/llvm/test/tools/llvm-rc/split-path.test b/llvm/test/tools/llvm-rc/split-path.test
new file mode 100644
index 000000000000000..adfcfa2af0add57
--- /dev/null
+++ b/llvm/test/tools/llvm-rc/split-path.test
@@ -0,0 +1,6 @@
+; RUN: rm -rf %t
+; RUN: mkdir %t
+; RUN: cd %t
+; RUN: mkdir subdir
+; RUN: cp %p/Inputs/icon-new.ico subdir
+; RUN: llvm-rc -no-preprocess /FO %t/split-path.res -- %p/Inputs/split-path.rc
diff --git a/llvm/test/tools/llvm-rc/tokenizer.test b/llvm/test/tools/llvm-rc/tokenizer.test
index 8486f8bd786909e..4d327e4084b27e8 100644
--- a/llvm/test/tools/llvm-rc/tokenizer.test
+++ b/llvm/test/tools/llvm-rc/tokenizer.test
@@ -35,6 +35,8 @@
 ; CHECK-NEXT:  Int: 42; int value = 42
 ; CHECK-NEXT:  Comma: ,
 ; CHECK-NEXT:  Int: 100; int value = 100
+; CHECK-NEXT:  String: "concatenated"  L"string"
+; CHECK-NEXT:  "token"
 ; CHECK-NEXT:  Identifier: Block
 ; CHECK-NEXT:  Identifier: Comment
 ; CHECK-NEXT:  Identifier: Ident
diff --git a/llvm/tools/llvm-rc/ResourceFileWriter.cpp b/llvm/tools/llvm-rc/ResourceFileWriter.cpp
index dd9db338ece255b..31d42489841f694 100644
--- a/llvm/tools/llvm-rc/ResourceFileWriter.cpp
+++ b/llvm/tools/llvm-rc/ResourceFileWriter.cpp
@@ -109,6 +109,73 @@ static bool stripQuotes(StringRef &Str, bool &IsLongString) {
   return true;
 }
 
+static void splitStrings(StringRef Str, SmallVectorImpl<StringRef> &Result) {
+  if (!Str.starts_with_insensitive("L\"") &&
+      !Str.starts_with_insensitive("\"")) {
+    Result.push_back(Str);
+    return;
+  }
+  while (!Str.empty()) {
+    size_t Pos = 0;
+    if (Str.starts_with_insensitive("L"))
+      ++Pos;
+    assert(Pos < Str.size() && Str[Pos] == '"' &&
+           "Strings should be enclosed in quotes.");
+    ++Pos;
+    while (Pos < Str.size()) {
+      if (Str[Pos] == '"') {
+        // Found a quote
+        if (Pos + 1 == Str.size()) {
+          // At the end of the string; enqueue the rest
+          Result.push_back(Str);
+          return;
+        }
+
+        if (Str[Pos + 1] == '"') {
+          // Double quote, skip past it
+          Pos += 2;
+          continue;
+        }
+
+        // Single quote in the middle of a string, terminating the current one.
+        Result.push_back(Str.take_front(Pos + 1));
+        Str = Str.drop_front(Pos + 1);
+        Str = Str.ltrim(); // Skip past whitespace
+        Pos = 0;
+        // Restart the outer loop, starting a new string
+        break;
+      }
+
+      // Not a quote, advance
+      ++Pos;
+    }
+    assert(Pos == 0 && "Should have consumed a string");
+  }
+  llvm_unreachable("Misquoted strings");
+}
+
+static bool extractQuotedStrings(StringRef &Str, SmallString<0> &Buf) {
+  // Extract a StringRef from a parser token that contains one or more
+  // quoted strings.
+  SmallVector<StringRef, 3> SubStrs;
+  splitStrings(Str, SubStrs);
+  bool IsLongString;
+  if (SubStrs.size() == 1) {
+    assert(Str == SubStrs[0]);
+    return stripQuotes(Str, IsLongString);
+  } else {
+    // If the string token consisted of multiple quoted strings, run stripQuotes
+    // on each of them and concatenate the result in the user provided buffer.
+    for (StringRef S : SubStrs) {
+      if (!stripQuotes(S, IsLongString))
+        return false;
+      Buf.append(S);
+    }
+    Str = Buf.str();
+    return true;
+  }
+}
+
 static UTF16 cp1252ToUnicode(unsigned char C) {
   static const UTF16 Map80[] = {
       0x20ac, 0x0081, 0x201a, 0x0192, 0x201e, 0x2026, 0x2020, 0x2021,
@@ -360,6 +427,18 @@ static Error processString(StringRef Str, NullHandlingMethod NullHandler,
   return Error::success();
 }
 
+static Error processStrings(StringRef Str, NullHandlingMethod NullHandler,
+                            SmallVectorImpl<UTF16> &Result, int CodePage) {
+  SmallVector<StringRef, 3> SubStrs;
+  splitStrings(Str, SubStrs);
+  for (const auto &S : SubStrs) {
+    bool IsLongString;
+    RETURN_IF_ERROR(
+        processString(S, NullHandler, IsLongString, Result, CodePage));
+  }
+  return Error::success();
+}
+
 uint64_t ResourceFileWriter::writeObject(const ArrayRef<uint8_t> Data) {
   uint64_t Result = tell();
   FS->write((const char *)Data.begin(), Data.size());
@@ -368,10 +447,8 @@ uint64_t ResourceFileWriter::writeObject(const ArrayRef<uint8_t> Data) {
 
 Error ResourceFileWriter::writeCString(StringRef Str, bool WriteTerminator) {
   SmallVector<UTF16, 128> ProcessedString;
-  bool IsLongString;
-  RETURN_IF_ERROR(processString(Str, NullHandlingMethod::CutAtNull,
-                                IsLongString, ProcessedString,
-                                Params.CodePage));
+  RETURN_IF_ERROR(processStrings(Str, NullHandlingMethod::CutAtNull,
+                                 ProcessedString, Params.CodePage));
   for (auto Ch : ProcessedString)
     writeInt<uint16_t>(Ch);
   if (WriteTerminator)
@@ -400,8 +477,8 @@ void ResourceFileWriter::writeRCInt(RCInt Value) {
 }
 
 Error ResourceFileWriter::appendFile(StringRef Filename) {
-  bool IsLong;
-  stripQuotes(Filename, IsLong);
+  SmallString<0> Buf;
+  extractQuotedStrings(Filename, Buf);
 
   auto File = loadFile(Filename);
   if (!File)
@@ -645,8 +722,8 @@ Error ResourceFileWriter::writeSingleAccelerator(
   }
 
   StringRef Str = Obj.Event.getString();
-  bool IsWide;
-  stripQuotes(Str, IsWide);
+  SmallString<0> Buf;
+  extractQuotedStrings(Str, Buf);
 
   if (Str.size() == 0 || Str.size() > 2)
     return createAccError(
@@ -706,8 +783,8 @@ Error ResourceFileWriter::writeAcceleratorsBody(const RCResource *Base) {
 
 Error ResourceFileWriter::writeBitmapBody(const RCResource *Base) {
   StringRef Filename = cast<BitmapResource>(Base)->BitmapLoc;
-  bool IsLong;
-  stripQuotes(Filename, IsLong);
+  SmallString<0> Buf;
+  extractQuotedStrings(Filename, Buf);
 
   auto File = loadFile(Filename);
   if (!File)
@@ -884,8 +961,8 @@ Error ResourceFileWriter::visitIconOrCursorResource(const RCResource *Base) {
     Type = IconCursorGroupType::Cursor;
   }
 
-  bool IsLong;
-  stripQuotes(FileStr, IsLong);
+  SmallString<0> Buf;
+  extractQuotedStrings(FileStr, Buf);
   auto File = loadFile(FileStr);
 
   if (!File)
@@ -1323,10 +1400,9 @@ Error ResourceFileWriter::writeStringTableBundleBody(const RCResource *Base) {
     // (which is not null-terminated).
     SmallVector<UTF16, 128> Data;
     if (Res->Bundle.Data[ID]) {
-      bool IsLongString;
       for (StringRef S : *Res->Bundle.Data[ID])
-        RETURN_IF_ERROR(processString(S, NullHandlingMethod::CutAtDoubleNull,
-                                      IsLongString, Data, Params.CodePage));
+        RETURN_IF_ERROR(processStrings(S, NullHandlingMethod::CutAtDoubleNull,
+                                       Data, Params.CodePage));
       if (AppendNull)
         Data.push_back('\0');
     }
@@ -1372,21 +1448,25 @@ Error ResourceFileWriter::writeUserDefinedBody(const RCResource *Base) {
       continue;
     }
 
-    SmallVector<UTF16, 128> ProcessedString;
-    bool IsLongString;
-    RETURN_IF_ERROR(
-        processString(Elem.getString(), NullHandlingMethod::UserResource,
-                      IsLongString, ProcessedString, Params.CodePage));
+    SmallVector<StringRef, 3> SubStrs;
+    splitStrings(Elem.getString(), SubStrs);
+    for (const auto &S : SubStrs) {
+      SmallVector<UTF16, 128> ProcessedString;
+      bool IsLongString;
+      RETURN_IF_ERROR(processString(S, NullHandlingMethod::UserResource,
+                                    IsLongString, ProcessedString,
+                                    Params.CodePage));
+
+      for (auto Ch : ProcessedString) {
+        if (IsLongString) {
+          writeInt(Ch);
+          continue;
+        }
 
-    for (auto Ch : ProcessedString) {
-      if (IsLongString) {
-        writeInt(Ch);
-        continue;
+        RETURN_IF_ERROR(checkNumberFits<uint8_t>(
+            Ch, "Character in narrow string in user-defined resource"));
+        writeInt<uint8_t>(Ch);
       }
-
-      RETURN_IF_ERROR(checkNumberFits<uint8_t>(
-          Ch, "Character in narrow string in user-defined resource"));
-      writeInt<uint8_t>(Ch);
     }
   }
 
diff --git a/llvm/tools/llvm-rc/ResourceScriptToken.cpp b/llvm/tools/llvm-rc/ResourceScriptToken.cpp
index a8f40abdf8a680c..4422dbc75f8d378 100644
--- a/llvm/tools/llvm-rc/ResourceScriptToken.cpp
+++ b/llvm/tools/llvm-rc/ResourceScriptToken.cpp
@@ -188,6 +188,16 @@ Expected<std::vector<RCToken>> Tokenizer::run() {
         return getStringError("Integer invalid or too large: " +
                               Token.value().str());
       }
+    } else if (TokenKind == Kind::String && !Result.empty()) {
+      RCToken &Prev = Result.back();
+      if (Prev.kind() == Kind::String) {
+        StringRef PrevStr = Prev.value();
+        StringRef Concat(PrevStr.data(), Token.value().data() +
+                                             Token.value().size() -
+                                             PrevStr.data());
+        Result.pop_back();
+        Token = RCToken(TokenKind, Concat);
+      }
     }
 
     Result.push_back(Token);

``````````

</details>


https://github.com/llvm/llvm-project/pull/68685