[PATCH] D112916: [clang-tidy] Confusable identifiers detection

Mon Jun 20 01:05:43 PDT 2022

cor3ntin added a comment.

So that this doesn't get lost, I'm opposed to this change because it does not actually follow the TR39 guidance or algorithm.

  For an input string X, define skeleton(X) to be the following transformation on the string:

  1. Convert X to NFD format, as described in [UAX15].
  2. Concatenate the prototypes for each character in X according to the specified data, producing a string of exemplar characters.
  3. Reapply NFD.

Step 1 and 3 not being performed will lead to false negative.

That being said, given that it is for clang tidy and that adding decomposition to llvm is a bit involved (I could look into it if there is interest), this is not a strong opposition, but i wanted to make sure people are aware of the limitation.
I think we should have a fixme in the code so that we don't forget.

================
Comment at: clang-tools-extra/clang-tidy/misc/ConfusableTable/build_confusable_table.cpp:59
+  llvm::raw_fd_ostream os(argv[2], ec);
+  os << "struct {llvm::UTF32 codepoint; llvm::UTF32 values[32];} "
+        "ConfusableEntries[] = {\n";
----------------
This is terribly menory inneficient - most confusables are 1-3 code point longs.
Consider
 *  Replacing the hard coded 32 by the actual maximum length that you can extract when buikding the table
  * Having a table for the common case of 1-3 codepoints for the common case and another one for the long sequence. 

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112916/new/

https://reviews.llvm.org/D112916