[PATCH] D128059: [Clang] Add a warning on invalid UTF-8 in comments.

Corentin Jabot via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Tue Jul 12 17:29:36 PDT 2022


cor3ntin updated this revision to Diff 444118.
cor3ntin added a comment.

This turned out to be an interesting bug.
The SSE code tried to be clever and skip over valid ascii code units when finding invalid UTF-8.
In doing so, it could run over the end of a comment entirely if

- there was a short ascii comment
- followed by a tiny amount of C++
- followed by another comment containing non-ascii data.

It does not matter whether it was valid or not (which was misleading 
as the file that tripped the bot is full of invalid code units.

The problematic test boils down to

  enum a {
      x  /* 01234567890ABCDEF*/
  };
  /*ααααααααα*/

The fix is to do in the SSE codepath what we do in the altivec
and default paths: if we find an invalid code unit,
we rescan that bit of the comment on the slow path
without trying to update `CurPtr` (and for each code unit, 
checking both for isASCIII an != '/' at the same time.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128059/new/

https://reviews.llvm.org/D128059

Files:
  clang/docs/ReleaseNotes.rst
  clang/include/clang/Basic/DiagnosticLexKinds.td
  clang/lib/Lex/Lexer.cpp
  clang/test/Lexer/comment-invalid-utf8.c
  clang/test/Lexer/comment-utf8.c
  clang/test/SemaCXX/static-assert.cpp
  llvm/include/llvm/Support/ConvertUTF.h
  llvm/lib/Support/ConvertUTF.cpp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D128059.444118.patch
Type: text/x-patch
Size: 10027 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20220713/c7c9eeaa/attachment.bin>


More information about the cfe-commits mailing list