[cfe-dev] [patch] Support for C++0x raw string literals
Douglas Gregor
dgregor at apple.com
Wed Aug 3 10:57:01 PDT 2011
On Jul 31, 2011, at 12:24 AM, Craig Topper wrote:
> This adds test cases and fixes a few minor bugs from the first patch.
>
> ~Craig
>
> On Wed, Jul 27, 2011 at 11:27 PM, Craig Topper <craig.topper at gmail.com> wrote:
>> Still need to write up test cases, but so far this patch seems to
>> work. The lexer code is getting pretty ugly. I still need to fix up
>> token concatenation.
>>
>> --
>> ~Craig
> <raw_strings.patch>
Cool. A few comments:
+def err_raw_delim_too_long : Error<
+ "raw string delimiter longer than 16 characters">;
While technically correct, this isn't a very helpful diagnostic. I'm guessing that users are going to write raw string literals and forget the XXX()XXX part of it. The first question they'll ask is, "what's a delimiter?"
Diving into the code, I think there's a better way:
+/// LexRawStringLiteral - Lex the remainder of a string literal, after having
+/// lexed either R".
+void Lexer::LexRawStringLiteral(Token &Result, const char *CurPtr,
+ tok::TokenKind Kind) {
+ char C;
+ const char *Prefix = CurPtr;
+ unsigned PrefixLen = 0;
+
+ while (PrefixLen != 16) {
+ C = getAndAdvanceChar(CurPtr, Result);
+ switch (C) {
+ case ' ': case '(': case ')': case '\\': case '\t': case '\v':
+ case '\f': case '\n': default:
+ break;
+ /* Basic source charset except the above chars. */
+ case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': case 'G':
+ case 'H': case 'I': case 'J': case 'K': case 'L': case 'M': case 'N':
+ case 'O': case 'P': case 'Q': case 'R': case 'S': case 'T': case 'U':
+ case 'V': case 'W': case 'X': case 'Y': case 'Z':
+ case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g':
+ case 'h': case 'i': case 'j': case 'k': case 'l': case 'm': case 'n':
+ case 'o': case 'p': case 'q': case 'r': case 's': case 't': case 'u':
+ case 'v': case 'w': case 'x': case 'y': case 'z':
+ case '0': case '1': case '2': case '3': case '4':
+ case '5': case '6': case '7': case '8': case '9':
+ case '_': case '{': case '}': case '[': case ']': case '#': case '<':
+ case '>': case '%': case ':': case ';': case '.': case '?': case '*':
+ case '+': case '-': case '/': case '^': case '&': case '|': case '~':
+ case '!': case '=': case ',':
+ case '"': case '\'':
+ PrefixLen++;
+ continue;
+ }
+ break;
+ }
+
+ if (PrefixLen == 16) {
+ C = getAndAdvanceChar(CurPtr, Result);
+ }
Why break out when PrefixLen==16? You could just parse an arbitrary-length d-char-sequence, and then decide what to do afterward based on what character comes next:
- If it's a (, form the prefix and complain if it's too long.
- If it's a ", tell the user that they forgot the delimiters. You could even suggest a delimiter in a Fix-It :)
- Otherwise, error
+ return (Ptr[0] == 'L' ||
+ (LangOpts.CPlusPlus0x &&
+ (Ptr[0] == 'u' || Ptr[0] == 'U' || Ptr[0] == 'R'))) &&
+ (Tok.getLength() == 1 ||
+ (Ptr[1] == 'R' && Ptr[0] != 'R' &&
+ Tok.getLength() == 2 && LangOpts.CPlusPlus0x) ||
+ (Ptr[0] == 'u' && Ptr[1] == '8' &&
+ (Tok.getLength() == 2 || (Ptr[2] == 'R' && Tok.getLength() == 3))));
}
This condition is getting really hard to read. Multiple 'if' statements with comments, perhaps?
+ return (TokPtr[0] == 'L' ||
+ (LangOpts.CPlusPlus0x &&
+ (TokPtr[0] == 'u' || TokPtr[0] == 'U' || TokPtr[0] == 'R'))) &&
+ (length == 1 ||
+ (TokPtr[1] == 'R' && TokPtr[0] != 'R' &&
+ length == 2 && LangOpts.CPlusPlus0x) ||
+ (TokPtr[0] == 'u' && TokPtr[1] == '8' &&
+ (length == 2 || (TokPtr[2] == 'R' && length == 3))));
Looks the same as the above. How about factoring this out into an inline function?
Index: lib/Lex/LiteralSupport.cpp
===================================================================
--- lib/Lex/LiteralSupport.cpp (revision 136584)
+++ lib/Lex/LiteralSupport.cpp (working copy)
@@ -964,6 +964,8 @@
const char *ThisTokEnd = ThisTokBuf+ThisTokLen-1; // Skip end quote.
// TODO: Input character set mapping support.
+ bool RawString = false;
+
// Skip L marker for wide strings.
if (ThisTokBuf[0] == 'L' || ThisTokBuf[0] == 'u' || ThisTokBuf[0] == 'U') {
++ThisTokBuf;
(etc.)
Please update the comment for StringLiteralParser as well!
+ } else if (C == 0 && CurPtr-1 == BufferEnd) { // End of file.
+ if (!isLexingRawMode())
+ Diag(BufferPtr, diag::err_unterminated_raw_string);
This diagnostic here should include the prefix, so that the user knows how to terminate the raw string, e.g.,
error: raw string not terminated with ')PREFIX'
Things are looking very good. Thanks for working on this!
- Doug
More information about the cfe-dev
mailing list