[cfe-dev] [patch] Support for C++0x raw string literals

Wed Aug 3 10:57:01 PDT 2011

On Jul 31, 2011, at 12:24 AM, Craig Topper wrote:

> This adds test cases and fixes a few minor bugs from the first patch.
> 
> ~Craig
> 
> On Wed, Jul 27, 2011 at 11:27 PM, Craig Topper <craig.topper at gmail.com> wrote:
>> Still need to write up test cases, but so far this patch seems to
>> work. The lexer code is getting pretty ugly. I still need to fix up
>> token concatenation.
>> 
>> --
>> ~Craig
> <raw_strings.patch>

Cool. A few comments:

+def err_raw_delim_too_long : Error<
+  "raw string delimiter longer than 16 characters">;

While technically correct, this isn't a very helpful diagnostic. I'm guessing that users are going to write raw string literals and forget the XXX()XXX part of it. The first question they'll ask is, "what's a delimiter?"

Diving into the code, I think there's a better way:

+/// LexRawStringLiteral - Lex the remainder of a string literal, after having
+/// lexed either R".
+void Lexer::LexRawStringLiteral(Token &Result, const char *CurPtr,
+                                tok::TokenKind Kind) {
+  char C;
+  const char *Prefix = CurPtr;
+  unsigned PrefixLen = 0;
+
+  while (PrefixLen != 16) {
+    C = getAndAdvanceChar(CurPtr, Result);
+    switch (C) {
+      case ' ': case '(': case ')': case '\\': case '\t': case '\v':
+      case '\f': case '\n': default:
+        break;
+      /* Basic source charset except the above chars.  */
+      case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': case 'G':
+      case 'H': case 'I': case 'J': case 'K': case 'L': case 'M': case 'N':
+      case 'O': case 'P': case 'Q': case 'R': case 'S': case 'T': case 'U':
+      case 'V': case 'W': case 'X': case 'Y': case 'Z':
+      case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g':
+      case 'h': case 'i': case 'j': case 'k': case 'l': case 'm': case 'n':
+      case 'o': case 'p': case 'q': case 'r': case 's': case 't': case 'u':
+      case 'v': case 'w': case 'x': case 'y': case 'z':
+      case '0': case '1': case '2': case '3': case '4':
+      case '5': case '6': case '7': case '8': case '9':
+      case '_': case '{': case '}': case '[': case ']': case '#': case '<':
+      case '>': case '%': case ':': case ';': case '.': case '?': case '*':
+      case '+': case '-': case '/': case '^': case '&': case '|': case '~':
+      case '!': case '=': case ',':
+      case '"': case '\'':
+        PrefixLen++;
+        continue;
+    }
+    break;
+  }
+
+  if (PrefixLen == 16) {
+    C = getAndAdvanceChar(CurPtr, Result);
+  }

Why break out when PrefixLen==16? You could just parse an arbitrary-length d-char-sequence, and then decide what to do afterward based on what character comes next:

  - If it's a (, form the prefix and complain if it's too long.
  - If it's a ", tell the user that they forgot the delimiters. You could even suggest a delimiter in a Fix-It :)
  - Otherwise, error

+    return (Ptr[0] == 'L' ||
+            (LangOpts.CPlusPlus0x &&
+             (Ptr[0] == 'u' || Ptr[0] == 'U' || Ptr[0] == 'R'))) &&
+           (Tok.getLength() == 1 ||
+            (Ptr[1] == 'R' && Ptr[0] != 'R' &&
+             Tok.getLength() == 2 && LangOpts.CPlusPlus0x) ||
+            (Ptr[0] == 'u' && Ptr[1] == '8' &&
+             (Tok.getLength() == 2 || (Ptr[2] == 'R' && Tok.getLength() == 3))));
   }

This condition is getting really hard to read. Multiple 'if' statements with comments, perhaps?

+    return (TokPtr[0] == 'L' ||
+            (LangOpts.CPlusPlus0x &&
+             (TokPtr[0] == 'u' || TokPtr[0] == 'U' || TokPtr[0] == 'R'))) &&
+           (length == 1 ||
+            (TokPtr[1] == 'R' && TokPtr[0] != 'R' &&
+             length == 2 && LangOpts.CPlusPlus0x) ||
+            (TokPtr[0] == 'u' && TokPtr[1] == '8' &&
+             (length == 2 || (TokPtr[2] == 'R' && length == 3))));

Looks the same as the above. How about factoring this out into an inline function?

Index: lib/Lex/LiteralSupport.cpp
===================================================================

--- lib/Lex/LiteralSupport.cpp	(revision 136584)
+++ lib/Lex/LiteralSupport.cpp	(working copy)
@@ -964,6 +964,8 @@
     const char *ThisTokEnd = ThisTokBuf+ThisTokLen-1;  // Skip end quote.
     // TODO: Input character set mapping support.
 
+    bool RawString = false;
+
     // Skip L marker for wide strings.
     if (ThisTokBuf[0] == 'L' || ThisTokBuf[0] == 'u' || ThisTokBuf[0] == 'U') {
       ++ThisTokBuf;
(etc.)

Please update the comment for StringLiteralParser as well!

+    } else if (C == 0 && CurPtr-1 == BufferEnd) { // End of file.
+      if (!isLexingRawMode())
+        Diag(BufferPtr, diag::err_unterminated_raw_string);

This diagnostic here should include the prefix, so that the user knows how to terminate the raw string, e.g.,

	error: raw string not terminated with ')PREFIX'

Things are looking very good. Thanks for working on this!

	- Doug