[PATCH] D112916: [clang-tidy] Confusable identifiers detection

Whisperity via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Mon Jun 20 00:22:57 PDT 2022


whisperity added inline comments.


================
Comment at: clang-tools-extra/clang-tidy/misc/Homoglyph.cpp:32-35
+/**
+ * Build a skeleton out of the Original identifier, following the algorithm
+ * described in http://www.unicode.org/reports/tr39/#def-skeleton
+ */
----------------
`/*` -> `//` or `///` (latter is for documentation comments)


================
Comment at: clang-tools-extra/clang-tidy/misc/Homoglyph.cpp:41-42
+
+  char const *Curr = SName.c_str();
+  char const *End = Curr + SName.size();
+  while (Curr < End) {
----------------
(Nit: in LLVM we use //"const west"//)


================
Comment at: clang-tools-extra/clang-tidy/misc/Homoglyph.cpp:48
+    llvm::ConversionResult Result = llvm::convertUTF8Sequence(
+        (const llvm::UTF8 **)&Curr, (const llvm::UTF8 *)End, &CodePoint,
+        llvm::strictConversion);
----------------
These casts are weird when reading the code. The problem with C-style casts is that the compiler will try to resolve it to //some// kind of cast, which makes them less "explicit" than would be good for type safety. Could you replace these with `static_cast`s? Or are these, in fact, `reinterpret_cast`s?


================
Comment at: clang-tools-extra/clang-tidy/misc/Homoglyph.cpp:64-68
+      llvm::UTF8 Buffer[32];
+      llvm::UTF8 *BufferStart = std::begin(Buffer);
+      llvm::UTF8 *IBuffer = BufferStart;
+      const llvm::UTF32 *ValuesStart = std::begin(Where->values);
+      const llvm::UTF32 *ValuesEnd =
----------------
(Nit: consider adding a `using namespace llvm;` **inside** the function and then you could reduce the length of these types.)


================
Comment at: clang-tools-extra/clang-tidy/misc/Homoglyph.cpp:84
+  if (const auto *ND = Result.Nodes.getNodeAs<NamedDecl>("nameddecl")) {
+    if(  IdentifierInfo *   II = ND->getIdentifier()) {
+      StringRef NDName = II->getName();
----------------
(Nit: formatting is broken here.)


================
Comment at: clang-tools-extra/clang-tidy/misc/Homoglyph.cpp:86-88
+      auto &Mapped = Mapper[skeleton(NDName)];
+      auto *NDDecl = ND->getDeclContext();
+      for (auto *OND : Mapped) {
----------------
(Nit: we tend to use `auto` if and only if the type of the variable is either reasonably impossible to spell out (e.g. an iterator into a container with 3784723 parameters) or immediately obvious from the initialiser expression (e.g. the init is a cast).)


================
Comment at: clang-tools-extra/clang-tidy/misc/Homoglyph.cpp:93
+        if (OND->getName() != NDName) {
+          diag(OND->getLocation(), "%0 is confusable with %1")
+              << OND->getName() << NDName;
----------------
Perhaps you could elaborate on "confusable" here a bit for the user?


================
Comment at: clang-tools-extra/clang-tidy/misc/Homoglyph.cpp:94
+          diag(OND->getLocation(), "%0 is confusable with %1")
+              << OND->getName() << NDName;
+          diag(ND->getLocation(), "other definition found here",
----------------
Careful: `getName()` might assert away if the identifier isn't "trivially nameable", e.g. `operator==`. Is this prevented somewhere? (Otherwise, could we add a negative test case just to ensure no crashes start happening?) Operators are also `NamedDecl`s.


================
Comment at: clang-tools-extra/clang-tidy/misc/Homoglyph.cpp:95
+              << OND->getName() << NDName;
+          diag(ND->getLocation(), "other definition found here",
+               DiagnosticIDs::Note);
----------------
Definition or declaration? "Entity"?


================
Comment at: clang-tools-extra/clang-tidy/misc/Homoglyph.h:18
+
+class Homoglyph : public ClangTidyCheck {
+public:
----------------
This is the "top-level" check that is the addition in this patch, right?

In that case, this class is breaking naming conventions (usually the symbol name ends with `Check`) and the "usual" documentation comment from the class is also missing.

-----

Also, maybe a full rename to `ConfusableIdentifierCheck` would be worth it. If you'd like to stick to the current summary of the patch.


================
Comment at: clang-tools-extra/clang-tidy/misc/MiscTidyModule.cpp:41
         "misc-misleading-bidirectional");
+    CheckFactories.registerCheck<Homoglyph>("misc-homoglyph");
     CheckFactories.registerCheck<MisleadingIdentifierCheck>(
----------------
The comment about coming up with a better name for the check I think applies here too. I would be comfortable with `misc-confusable-identifiers`. The problem with `homoglyph` is that it requires a specific understanding of English and (natural) language studies that we should not hard-require from our users. If I have to look up on Wikipedia what the name of the rule I'm using means then something is wrong.


================
Comment at: clang-tools-extra/docs/ReleaseNotes.rst:141
+
+  Detects confusable unicode identifiers.
+
----------------
**U**nicode? Isn't it a trademark?


================
Comment at: clang-tools-extra/docs/clang-tidy/checks/misc-homoglyph.rst:8
+each other, but use different Unicode characters. This detects a potential
+attack described in `Trojan Source <https://www.trojansource.codes>`_.
+
----------------
Is the associated research paper **published** anywhere? I can only come up with arXiv preprints. However, adding a link to arXiv would be more "future proof" than the (agreeably fancy) website itself. The site mentions one and the research paper mentions two CVE numbers for the attack vector... perhaps we should just link the CVE database.


================
Comment at: clang-tools-extra/test/clang-tidy/checkers/misc-homoglyph.cpp:16-19
+int 𝐟i;
+// CHECK-MESSAGES: :[[#@LINE-1]]:5: warning: 𝐟i is confusable with fi [misc-homoglyph]
+int fi;
+// CHECK-MESSAGES: :[[#@LINE-1]]:5: note: other definition found here
----------------
(Nit: If we keep the message blocks together they will be more resilient to accidental modifications. You can have any kind of offset in the test comment, anyway.)


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112916/new/

https://reviews.llvm.org/D112916



More information about the cfe-commits mailing list