[PATCH] D128059: [Clang] Add a warning on invalid UTF-8 in comments.

Tue Jun 21 23:49:40 PDT 2022

cor3ntin added inline comments.

================
Comment at: clang/lib/Lex/Lexer.cpp:2399
+                                      getSourceLocation(CurPtr));
+  bool UnicodeDecodeFailed = false;
+
----------------
aaron.ballman wrote:
> It looks like this can move into the `while` loop and we can remove some `= false` from within the loop?
The idea here is to not diagnose a contiguous sequence of invalid code units more than once
The sequence  `0x80x0x80x80` for example will only warn one for example. The goal is to avoid noise when, for example there is a long comment in Windows-1251 encoded cyrillic. it would warn one per word (because the spaces would decode fine) rather than one per contiguous non-sense character.
It's also somewhat necessary because we can't know where the end of the invalid "character" is, better to be greedy. 
(The test file check that behavior)

================
Comment at: clang/lib/Lex/Lexer.cpp:2407
+    // diagnostic only once per sequence that cannot be decoded.
+    while ((!WarnOnInvalidUtf8 || isASCII(C)) && C != 0 && // Potentially EOF.
+           C != '\n' && C != '\r') { // Newline or DOS-style newline.
----------------
aaron.ballman wrote:
> Do you have some idea of performance cost for enabling the warning with a large TU? Are we talking .5%? 1%? 5%? noise?
This adds 2 comparisons when the warning is enbled (is ASCII(c) is simply `C < 0x80`) - so it should be noise, but I have not run benchmark.
It might be more noticeable within multi line comments as there the optimization that vectorizes the comment skipping on SSE2 platforms is simply skipped when this optimization is enabled.

================
Comment at: clang/lib/Lex/Lexer.cpp:2417
+      if (Length == 0) {
+        if (!UnicodeDecodeFailed) {
+          Diag(CurPtr, diag::warn_invalid_utf8_in_comment);
----------------
aaron.ballman wrote:
> aaron.ballman wrote:
> > 
> So if unicode decoding didn't fail... diagnose? :-) The naming here seems unfortunate.
`if(!UnicodeDecodeFailedInThePreviousLoopIteration)` ? I'm open to suggestion here

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128059/new/

https://reviews.llvm.org/D128059