[PATCH] D128059: [Clang] Add a warning on invalid UTF-8 in comments.
Corentin Jabot via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Jul 12 17:29:37 PDT 2022
cor3ntin updated this revision to Diff 444118.
cor3ntin added a comment.
This turned out to be an interesting bug.
The SSE code tried to be clever and skip over valid ascii code units when finding invalid UTF-8.
In doing so, it could run over the end of a comment entirely if
- there was a short ascii comment
- followed by a tiny amount of C++
- followed by another comment containing non-ascii data.
It does not matter whether it was valid or not (which was misleading
as the file that tripped the bot is full of invalid code units.
The problematic test boils down to
enum a {
x /* 01234567890ABCDEF*/
};
/*ααααααααα*/
The fix is to do in the SSE codepath what we do in the altivec
and default paths: if we find an invalid code unit,
we rescan that bit of the comment on the slow path
without trying to update `CurPtr` (and for each code unit,
checking both for isASCIII an != '/' at the same time.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D128059/new/
https://reviews.llvm.org/D128059
Files:
clang/docs/ReleaseNotes.rst
clang/include/clang/Basic/DiagnosticLexKinds.td
clang/lib/Lex/Lexer.cpp
clang/test/Lexer/comment-invalid-utf8.c
clang/test/Lexer/comment-utf8.c
clang/test/SemaCXX/static-assert.cpp
llvm/include/llvm/Support/ConvertUTF.h
llvm/lib/Support/ConvertUTF.cpp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D128059.444118.patch
Type: text/x-patch
Size: 10027 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220713/c7c9eeaa/attachment.bin>
More information about the llvm-commits
mailing list