[PATCH] D128059: [Clang] Add a warning on invalid UTF-8 in comments.

Wed Jun 22 04:36:40 PDT 2022

aaron.ballman added inline comments.

================
Comment at: clang/lib/Lex/Lexer.cpp:2702-2703
     // (probably ending) '/' character.
-    if (CurPtr + 24 < BufferEnd &&
+    // When diagnosing invalid UTF-8 sequences we always skip the fast
+    // vectorized path.
+    if (!WarnOnInvalidUtf8 && CurPtr + 24 < BufferEnd &&
----------------
However, *this* seems like it could noticeably impact performance, especially for C code (including C header files in C++ mode), and may be worth getting measurements for.

================
Comment at: clang/lib/Lex/Lexer.cpp:2399
+                                      getSourceLocation(CurPtr));
+  bool UnicodeDecodeFailed = false;
+
----------------
cor3ntin wrote:
> aaron.ballman wrote:
> > It looks like this can move into the `while` loop and we can remove some `= false` from within the loop?
> The idea here is to not diagnose a contiguous sequence of invalid code units more than once
> The sequence  `0x80x0x80x80` for example will only warn one for example. The goal is to avoid noise when, for example there is a long comment in Windows-1251 encoded cyrillic. it would warn one per word (because the spaces would decode fine) rather than one per contiguous non-sense character.
> It's also somewhat necessary because we can't know where the end of the invalid "character" is, better to be greedy. 
> (The test file check that behavior)
Oh derp, you're right (of course), sorry for the noise!

================
Comment at: clang/lib/Lex/Lexer.cpp:2407
+    // diagnostic only once per sequence that cannot be decoded.
+    while ((!WarnOnInvalidUtf8 || isASCII(C)) && C != 0 && // Potentially EOF.
+           C != '\n' && C != '\r') { // Newline or DOS-style newline.
----------------
cor3ntin wrote:
> aaron.ballman wrote:
> > Do you have some idea of performance cost for enabling the warning with a large TU? Are we talking .5%? 1%? 5%? noise?
> This adds 2 comparisons when the warning is enbled (is ASCII(c) is simply `C < 0x80`) - so it should be noise, but I have not run benchmark.
> It might be more noticeable within multi line comments as there the optimization that vectorizes the comment skipping on SSE2 platforms is simply skipped when this optimization is enabled.
Ah, no need to run the benchmark then, I was thinking this would involve function call overhead and more complex checking for the `isASCII` call than exists in reality.

================
Comment at: clang/lib/Lex/Lexer.cpp:2417
+      if (Length == 0) {
+        if (!UnicodeDecodeFailed) {
+          Diag(CurPtr, diag::warn_invalid_utf8_in_comment);
----------------
cor3ntin wrote:
> aaron.ballman wrote:
> > aaron.ballman wrote:
> > > 
> > So if unicode decoding didn't fail... diagnose? :-) The naming here seems unfortunate.
> `if(!UnicodeDecodeFailedInThePreviousLoopIteration)` ? I'm open to suggestion here
`UnicodeDecodingAlreadyDiagnosed`

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128059/new/

https://reviews.llvm.org/D128059