[PATCH] D128059: [Clang] Add a warning on invalid UTF-8 in comments.

Aaron Ballman via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Tue Jun 21 06:57:57 PDT 2022


aaron.ballman added inline comments.


================
Comment at: clang/include/clang/Basic/DiagnosticLexKinds.td:116-117
   "source file is not valid UTF-8">;
+def warn_invalid_utf8_in_comment : Warning<
+  "invalid UTF-8 in comment">, InGroup<DiagGroup<"invalid-utf8">>, DefaultIgnore;
 def err_character_not_allowed : Error<
----------------
What would it take to enable the diagnostic by default? We don't typically add off-by-default diagnostics because there's evidence that not many folks enable them.

Alternatively, would this make sense as a pedantic warning? It's not really an extension for us to accept this stuff, but it seems like we can still pedantically diagnose the code?


================
Comment at: clang/lib/Lex/Lexer.cpp:2399
+                                      getSourceLocation(CurPtr));
+  bool UnicodeDecodeFailed = false;
+
----------------
It looks like this can move into the `while` loop and we can remove some `= false` from within the loop?


================
Comment at: clang/lib/Lex/Lexer.cpp:2407
+    // diagnostic only once per sequence that cannot be decoded.
+    while ((!WarnOnInvalidUtf8 || isASCII(C)) && C != 0 && // Potentially EOF.
+           C != '\n' && C != '\r') { // Newline or DOS-style newline.
----------------
Do you have some idea of performance cost for enabling the warning with a large TU? Are we talking .5%? 1%? 5%? noise?


================
Comment at: clang/lib/Lex/Lexer.cpp:2417
+      if (Length == 0) {
+        if (!UnicodeDecodeFailed) {
+          Diag(CurPtr, diag::warn_invalid_utf8_in_comment);
----------------
aaron.ballman wrote:
> 
So if unicode decoding didn't fail... diagnose? :-) The naming here seems unfortunate.


================
Comment at: clang/lib/Lex/Lexer.cpp:2417-2419
+        if (!UnicodeDecodeFailed) {
+          Diag(CurPtr, diag::warn_invalid_utf8_in_comment);
+        }
----------------



================
Comment at: clang/lib/Lex/Lexer.cpp:2698
+                                      getSourceLocation(CurPtr));
+  bool UnicodeDecodeFailed = false;
+
----------------
I think this can move into the loop as well.


================
Comment at: clang/lib/Lex/Lexer.cpp:2764-2766
+        if (!UnicodeDecodeFailed) {
+          Diag(CurPtr, diag::warn_invalid_utf8_in_comment);
+        }
----------------
Same concerns about naming here as above.


================
Comment at: llvm/lib/Support/ConvertUTF.cpp:426-429
+  if (length > sourceEnd - source) {
+    return 0;
+  }
+  return isLegalUTF8(source, length) ? length : 0;
----------------



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128059/new/

https://reviews.llvm.org/D128059



More information about the llvm-commits mailing list