[PATCH] D112913: Misleading bidirectional detection

Wed Nov 24 04:08:44 PST 2021

upsuper added a comment.

I'm not familiar with LLVM / Clang codebase. I was asked by @MaskRay to help review the bidi-related part of this patch.

Generally I believe the algorithm here is too naive that it produces both false positives and false negatives. I'm not sure how important those edge cases are considered, but anyway... details comments below:

================
Comment at: clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp:48-49
+      // Line break: https://www.unicode.org/reports/tr14/tr14-32.html
+      if (C == '\n' || C == '\r' || C == '\f' || C == '\v' ||
+          C == 0x85 /*next line*/)
+        EmbeddingOverride = Isolate = 0;
----------------
UAX 14 is probably not the right document to look here.

According to UAX 9 step [[ https://www.unicode.org/reports/tr9/#L1 | L1 ]], the embedding level is reset to the paragraph embedding level when hitting segment separator and paragraph separator, which are defined in [[ https://www.unicode.org/reports/tr9/#Table_Bidirectional_Character_Types | table 4 ]] as type B and S, and in UCD [[ https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedBidiClass.txt | DerivedBidiClass.txt ]] you can see type B includes U+000A, U+000D, U+001C..001E, U+0085, U+2029, and type S includes U+0009, U+000B, and U+001F.

You have U+000C and U+2028 here which are counted as type WS which doesn't affect embedding level, so you may be resetting the counter prematurely. And you may want to reset the counter for all other characters above.

================
Comment at: clang-tools-extra/clang-tidy/misc/MisleadingBidirectional.cpp:64-72
+    if (CodePoint == RLO || CodePoint == RLE || CodePoint == LRO ||
+        CodePoint == LRE)
+      EmbeddingOverride += 1;
+    else if (CodePoint == PDF)
+      EmbeddingOverride = std::min(EmbeddingOverride - 1, EmbeddingOverride);
+    else if (CodePoint == RLI || CodePoint == LRI || CodePoint == FSI)
+      Isolate += 1;
----------------
The bidi algorithm is more complicated than having two simple counters. Basically, a `PDF` cancels a override / embedding character only when they match, and `PDI` cancels all override / embedding between it and its matching isolate character.

I was thinking whether the counters would be enough in the sense that we may accept some false positive for edge cases but definitely no false negative. But I'm convinced it's not the case. As an example, a sequence of `RLO LRI PDF PDI` will yield no embedding and no isolate in this counting model, but the `PDF` here actually has no effect, as shown in step [[ https://www.unicode.org/reports/tr9/#X7 | X7 ]], if it does not match an embedding initiator, it is ignored, so this is effectively leaving a dangling `RLO`.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112913/new/

https://reviews.llvm.org/D112913