[PATCH] D112914: Misleading identifier detection

Mon Nov 1 15:44:52 PDT 2021

rsmith accepted this revision.
rsmith added a comment.
This revision is now accepted and ready to land.

Looks good as far as it goes (though I've not checked your classification functions are correct).

We should have some detection for RTL characters in string literals and comments too, at least when there is no subsequent character with strong LTR directionality. Neither `"` nor `*/` has strong directionality, so both `א"` and `א*/` can potentially leave us in an RTL state. For example:

  // This is "א" then + then "ג".
  "א" + "ג"

  // This is `a` then `/*א*/` then `*` then `-` then `/*ג*/` then `b`, aka "a * - b" not "a - * b".
  a /*א*/ * - /*ג*/ b

================
Comment at: clang-tools-extra/clang-tidy/misc/MisleadingIdentifier.cpp:2
+//===--- MisleadingIdentifier.cpp - clang-tidy
+//--------------------------------===//
+//
----------------
Please fix

================
Comment at: clang-tools-extra/clang-tidy/misc/MisleadingIdentifier.cpp:19
+
+// see https://www.unicode.org/Public/14.0.0/ucd/extracted/DerivedBidiClass.txt
+static bool isUnassignedAL(llvm::UTF32 CP) {
----------------
(throughout).

================
Comment at: clang-tools-extra/clang-tidy/misc/MisleadingIdentifier.cpp:134-135
+        llvm::strictConversion);
+    if (Result != llvm::conversionOK)
+      break;
+    if (isUnassignedAL(CodePoint) || isUnassignedR(CodePoint) || isR(CodePoint))
----------------
As in the other patch, we could recover here by skipping until we find a valid UTF-8 lead byte. It's not important here since invalid UTF-8 won't be accepted in an identifier anyway, but might be important if we start checking strings and comments for RTL characters too.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112914/new/

https://reviews.llvm.org/D112914