[clang] Treat escaped newlines as whitespace in Lexer::getRawToken. (PR #117548)

Mon Nov 25 04:54:52 PST 2024

https://github.com/bazuzi created https://github.com/llvm/llvm-project/pull/117548

The Lexer used in getRawToken is not told to keep whitespace, so when it skips over escaped newlines, it also ignores whitespace, regardless of getRawToken's IgnoreWhiteSpace parameter. My suspicion is that users that want to not IgnoreWhiteSpace and therefore return true for a whitespace character would also safely accept true for an escaped newline. For users that do use IgnoreWhiteSpace, there is no behavior change, and the handling of escaped newlines is already correct.

If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at `Loc`, perhaps by using isWhitespace(SkipEscapedNewLines(StrData)[0]). However, this is incompatible with functions like clang::tidy::utils::lexer::getPreviousTokenAndStart. getPreviousTokenAndStart loops backwards through source location offsets, always decrementing by 1 without regard for potential character sizes larger than 1, such as escaped newlines. It seems more likely to me that there are more functions like this that would break than there are users who rely on escaped newlines not being treated as whitespace by getRawToken, but I'm open to that not being true.

The modified test was printing `\\nF` for the name of the expanded macro and now does not find a macro name. In my opinion, this is not an indication that the new behavior for getRawToken is incorrect. Rather, this is, both before and after this change, due to an incorrect storage of the backslash's source location as the spelling location of the expansion location of `F`.

>From 9c8b31dc266b770927785834c841b8ae5a7ebb58 Mon Sep 17 00:00:00 2001
From: Samira Bazuzi <bazuzi at google.com>
Date: Fri, 22 Nov 2024 15:45:55 -0500
Subject: [PATCH] Treat escaped newlines as whitespace in Lexer::getRawToken.

The Lexer used in getRawToken is not told to keep whitespace, so when it skips over escaped newlines, it also ignores whitespace, regardless of getRawToken's IgnoreWhiteSpace parameter. My suspicion is that users that want to not IgnoreWhiteSpace and therefore return true for a whitespace character would also safely accept true for an escaped newline. For users that do use IgnoreWhiteSpace, there is no behavior change, and the handling of escaped newlines is already correct.

If an escaped newline should not be considered whitespace, then instead of this change, getRawToken should be modified to return true when whitespace follows the escaped newline present at `Loc`, perhaps by using isWhitespace(SkipEscapedNewLines(StrData)[0]). However, this is incompatible with functions like clang::tidy::utils::lexer::getPreviousTokenAndStart. getPreviousTokenAndStart loops backwards through source location offsets, always decrementing by 1 without regard for potential character sizes larger than 1, such as escaped newlines. It seems more likely to me that there are more functions like this that would break than there are users who rely on escaped newlines not being treated as whitespace by getRawToken, but I'm open to that not being true.

The modified test was printing `\\nF` for the name of the expanded macro and now does not find a macro name. In my opinion, this is not an indication that the new behavior for getRawToken is incorrect. Rather, this is, both before and after this change, due to an incorrect storage of the backslash's source location as the spelling location of the expansion location of `F`.
---
 clang/lib/Lex/Lexer.cpp              | 4 +++-
 clang/test/Frontend/highlight-text.c | 3 +--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/clang/lib/Lex/Lexer.cpp b/clang/lib/Lex/Lexer.cpp
index e58c8bc72ae5b3..392cce6be0d171 100644
--- a/clang/lib/Lex/Lexer.cpp
+++ b/clang/lib/Lex/Lexer.cpp
@@ -527,7 +527,9 @@ bool Lexer::getRawToken(SourceLocation Loc, Token &Result,
 
   const char *StrData = Buffer.data()+LocInfo.second;
 
-  if (!IgnoreWhiteSpace && isWhitespace(StrData[0]))
+  if (!IgnoreWhiteSpace && (isWhitespace(StrData[0]) ||
+                            // Treat escaped newlines as whitespace.
+                            SkipEscapedNewLines(StrData) != StrData))
     return true;
 
   // Create a lexer starting at the beginning of this token.
diff --git a/clang/test/Frontend/highlight-text.c b/clang/test/Frontend/highlight-text.c
index a81d26caa4c24c..eefa4ebeec8ca4 100644
--- a/clang/test/Frontend/highlight-text.c
+++ b/clang/test/Frontend/highlight-text.c
@@ -12,8 +12,7 @@ int a = M;
 // CHECK-NEXT: :5:11: note: expanded from macro 'M'
 // CHECK-NEXT:     5 | #define M \
 // CHECK-NEXT:       |           ^
-// CHECK-NEXT: :3:14: note: expanded from macro '\
-// CHECK-NEXT: F'
+// CHECK-NEXT: :3:14: note: expanded from here
 // CHECK-NEXT:     3 | #define F (1 << 99)
 // CHECK-NEXT:       |              ^  ~~
 // CHECK-NEXT: :8:9: warning: shift count >= width of type [-Wshift-count-overflow]