[PATCH] D128059: [Clang] Add a warning on invalid UTF-8 in comments.

Tom Honermann via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Wed Jun 22 11:54:07 PDT 2022

tahonermann added inline comments.

Comment at: clang/lib/Lex/Lexer.cpp:2405-2406
     // Skip over characters in the fast loop.
-    while (C != 0 &&                // Potentially EOF.
-           C != '\n' && C != '\r')  // Newline or DOS-style newline.
+    // Warn on invalid UTF-8 if the corresponding warning is enabled, emitting a
+    // diagnostic only once per sequence that cannot be decoded.
+    while ((!WarnOnInvalidUtf8 || isASCII(C)) && C != 0 && // Potentially EOF.
cor3ntin wrote:
> aaron.ballman wrote:
> > tahonermann wrote:
> > > I think it would be helpful to include a link to [[ http://unicode.org/review/pr-121.html | Unicode PR issue 121 ]] here and note that the diagnostic follows policy option 1. Likewise for handling of '//' comment styles below. Alternatively, provide the link and a note in the release notes.
> > +1 -- I hadn't realized there were other options. However, ultimately I think this should be committed when WG21 has adopted the paper and then it would be even better to have a comment citing which stable name and paragraph number the code is implementing from P2295.
> I don't think this applies as we are not inserting replacement characters, but I'm all for adding a comment.
> I don't know if we should reference the c++ wording though, as I don;t think there is value in not applying that behavior in all language modes.
Perhaps phrase it something like "diagnostics are issued consistent with the replacement character insertion strategy of Unicode PR-121 policy option 1". The goal is to make it clear that the diagnostic strategy is not out of thin air; that it follows Unicode endorsed behavior.

  rG LLVM Github Monorepo



More information about the cfe-commits mailing list