[PATCH] D128059: [Clang] Add a warning on invalid UTF-8 in comments.

Aaron Ballman via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Wed Jun 22 12:06:26 PDT 2022

aaron.ballman added inline comments.

Comment at: clang/lib/Lex/Lexer.cpp:2717
+        __m128i Bytes =
+            _mm_loadu_si128(reinterpret_cast<const __m128i *>(CurPtr));
+        int Mask = _mm_movemask_epi8(Bytes);
cor3ntin wrote:
> This crashes when using `_mm_load_si128` which suprises me because `CurPtr` is supposedly aligned on a 16 bytes boundary here. Any idea?
Wait, did you verify that `CurPtr` really is on a 16-byte boundary, or are you thinking it should be on such a boundary? (I don't see any alignment markings on the parameter, so I'd assume it's aligned as any other pointer.)

Comment at: clang/lib/Lex/Lexer.cpp:2405-2406
     // Skip over characters in the fast loop.
-    while (C != 0 &&                // Potentially EOF.
-           C != '\n' && C != '\r')  // Newline or DOS-style newline.
+    // Warn on invalid UTF-8 if the corresponding warning is enabled, emitting a
+    // diagnostic only once per sequence that cannot be decoded.
+    while ((!WarnOnInvalidUtf8 || isASCII(C)) && C != 0 && // Potentially EOF.
tahonermann wrote:
> cor3ntin wrote:
> > aaron.ballman wrote:
> > > tahonermann wrote:
> > > > I think it would be helpful to include a link to [[ http://unicode.org/review/pr-121.html | Unicode PR issue 121 ]] here and note that the diagnostic follows policy option 1. Likewise for handling of '//' comment styles below. Alternatively, provide the link and a note in the release notes.
> > > +1 -- I hadn't realized there were other options. However, ultimately I think this should be committed when WG21 has adopted the paper and then it would be even better to have a comment citing which stable name and paragraph number the code is implementing from P2295.
> > I don't think this applies as we are not inserting replacement characters, but I'm all for adding a comment.
> > I don't know if we should reference the c++ wording though, as I don;t think there is value in not applying that behavior in all language modes.
> Perhaps phrase it something like "diagnostics are issued consistent with the replacement character insertion strategy of Unicode PR-121 policy option 1". The goal is to make it clear that the diagnostic strategy is not out of thin air; that it follows Unicode endorsed behavior.
Tom covered why I would prefer to see a comment, but as for the standards reference -- we already put in references to only one language for behavior we treat as an extension, so this would be more of the same. That ends up being helpful (especially if there's an explicit comment about it being an extension in other languages) because it tells the reader how the extension is roughly expected to behave as well.

  rG LLVM Github Monorepo



More information about the cfe-commits mailing list