[PATCH] D112913: Misleading bidirectional detection

Wed Nov 3 13:39:10 PDT 2021

serge-sans-paille added inline comments.

================
Comment at: clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp:59
+    // If conversion fails, utf-8 is designed so that we can just try next char.
+    if (Result != llvm::conversionOK) {
+      ++CurPtr;
----------------
rsmith wrote:
> Is there a guarantee that `convertUTF8Sequence` doesn't update `CurPtr` on error? I'm concerned we might increment *past* the end in the case where `CurPtr` points to the end, below, which would at least formally be UB.
According to the doc "If the conversion succeeds, this pointer will be updated to point to the byte just past the end of the converted sequence". My understanding of the implementation confirms that statements, and it's used in a similar manner in clang Lexer.

================
Comment at: clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp:60-62
+    // If conversion fails, we stop the analysis because we don't know how many
+    // characters should be skipped otherwise. That's an obvious way to bypass
+    // the check.
----------------
rsmith wrote:
> Can we scan forward looking for the next non-continuation byte? (Skip while `c & 0b1100_0000 == 0b1000_0000`)
I'm not quite sure. There's a risk we end up starting over from the second part of a unicode character, and that could mess up with the decoding. I'll investigate how it's done in other libraries.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112913/new/

https://reviews.llvm.org/D112913