[PATCH] D128059: [Clang] Add a warning on invalid UTF-8 in comments.

Corentin Jabot via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Wed Jun 22 11:04:06 PDT 2022

cor3ntin added inline comments.

Comment at: clang/lib/Lex/Lexer.cpp:2717
+        __m128i Bytes =
+            _mm_loadu_si128(reinterpret_cast<const __m128i *>(CurPtr));
+        int Mask = _mm_movemask_epi8(Bytes);
This crashes when using `_mm_load_si128` which suprises me because `CurPtr` is supposedly aligned on a 16 bytes boundary here. Any idea?

Comment at: clang/lib/Lex/Lexer.cpp:2405-2406
     // Skip over characters in the fast loop.
-    while (C != 0 &&                // Potentially EOF.
-           C != '\n' && C != '\r')  // Newline or DOS-style newline.
+    // Warn on invalid UTF-8 if the corresponding warning is enabled, emitting a
+    // diagnostic only once per sequence that cannot be decoded.
+    while ((!WarnOnInvalidUtf8 || isASCII(C)) && C != 0 && // Potentially EOF.
aaron.ballman wrote:
> tahonermann wrote:
> > I think it would be helpful to include a link to [[ http://unicode.org/review/pr-121.html | Unicode PR issue 121 ]] here and note that the diagnostic follows policy option 1. Likewise for handling of '//' comment styles below. Alternatively, provide the link and a note in the release notes.
> +1 -- I hadn't realized there were other options. However, ultimately I think this should be committed when WG21 has adopted the paper and then it would be even better to have a comment citing which stable name and paragraph number the code is implementing from P2295.
I don't think this applies as we are not inserting replacement characters, but I'm all for adding a comment.
I don't know if we should reference the c++ wording though, as I don;t think there is value in not applying that behavior in all language modes.

  rG LLVM Github Monorepo



More information about the cfe-commits mailing list