[PATCH] D128786: [clang-format] Fix incorrect isspace input (NFC)
Kevin Cadieux via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Tue Jun 28 23:42:57 PDT 2022
kevcadieux created this revision.
Herald added a project: All.
kevcadieux requested review of this revision.
Herald added a project: clang.
Herald added a subscriber: cfe-commits.
This change fixes a clang-format unit test failure introduced by D124748 <https://reviews.llvm.org/D124748>. The `countLeadingWhitespace` function was calling `isspace` with values that could fall outside the valid input range. The valid input range for `isspace` is unsigned 0-255. Values outside this range produce undefined behavior, which on Windows manifests as an assertion being raised in the debug runtime libraries. `countLeadingWhitespace` was calling `isspace` with a signed char that could produce a negative value if the underlying byte's value was 128 or above, which can happen for non-ASCII encodings. The fix is to use `StringRef`'s `bytes_begin` and `bytes_end` iterators to read the values as unsigned chars instead.
This bug can be reproduced by building the `check-clang-unit` target with a DEBUG configuration under Windows. This change is already covered by existing unit tests.
Repository:
rG LLVM Github Monorepo
https://reviews.llvm.org/D128786
Files:
clang/lib/Format/FormatTokenLexer.cpp
Index: clang/lib/Format/FormatTokenLexer.cpp
===================================================================
--- clang/lib/Format/FormatTokenLexer.cpp
+++ clang/lib/Format/FormatTokenLexer.cpp
@@ -864,8 +864,10 @@
// Directly using the regex turned out to be slow. With the regex
// version formatting all files in this directory took about 1.25
// seconds. This version took about 0.5 seconds.
- const char *Cur = Text.begin();
- while (Cur < Text.end()) {
+ const unsigned char *const Begin = Text.bytes_begin();
+ const unsigned char *const End = Text.bytes_end();
+ const unsigned char *Cur = Begin;
+ while (Cur < End) {
if (isspace(Cur[0])) {
++Cur;
} else if (Cur[0] == '\\' && (Cur[1] == '\n' || Cur[1] == '\r')) {
@@ -874,20 +876,20 @@
// The source has a null byte at the end. So the end of the entire input
// isn't reached yet. Also the lexer doesn't break apart an escaped
// newline.
- assert(Text.end() - Cur >= 2);
+ assert(End - Cur >= 2);
Cur += 2;
} else if (Cur[0] == '?' && Cur[1] == '?' && Cur[2] == '/' &&
(Cur[3] == '\n' || Cur[3] == '\r')) {
// Newlines can also be escaped by a '?' '?' '/' trigraph. By the way, the
// characters are quoted individually in this comment because if we write
// them together some compilers warn that we have a trigraph in the code.
- assert(Text.end() - Cur >= 4);
+ assert(End - Cur >= 4);
Cur += 4;
} else {
break;
}
}
- return Cur - Text.begin();
+ return Cur - Begin;
}
FormatToken *FormatTokenLexer::getNextToken() {
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D128786.440873.patch
Type: text/x-patch
Size: 1639 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20220629/e7493bf8/attachment-0001.bin>
More information about the cfe-commits
mailing list