[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.

Fri Feb 21 01:44:26 PST 2020

kadircet added inline comments.

================
Comment at: clang-tools-extra/clangd/SourceCode.cpp:70
     size_t UTF8Length = llvm::countLeadingOnes(C);
-    // 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
-    // 11111xxx is not valid UTF-8 at all. Assert because it's probably our bug.
-    assert((UTF8Length >= 2 && UTF8Length <= 4) &&
-           "Invalid UTF-8, or transcoding bug?");
     I += UTF8Length; // Skip over all trailing bytes.
+    // 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
----------------
can we also move this into valid case, and only increment by one on the invalid(extended ascii) case ?

as we are trying to recover from extended ascii, and they are 1 byte in length, not multiple.

================
Comment at: clang-tools-extra/clangd/SourceCode.cpp:71
     I += UTF8Length; // Skip over all trailing bytes.
-    // A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
-    // Astral codepoints are encoded as 4 bytes in UTF-8 (11110xxx ...)
-    if (CB(UTF8Length, UTF8Length == 4 ? 2 : 1))
-      return true;
+    // 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
+    // 11111xxx is not valid UTF-8 at all. Log it because it's probably our bug.
----------------
could you re-arrange this part into

```
if (LLVM_LIKELY(UTF8Length >= 2 && UTF8Length <=4)) {
      // A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
      // Astral codepoints are encoded as 4 bytes in UTF-8 (11110xxx ...)
      if (CB(UTF8Length, UTF8Length >> 1))
        return true;
      continue;
}

vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8);
// If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16
if (CB(UTF8Length, UTF8Length == 1 ? 1 : 2))
   return true;
```

================
Comment at: clang-tools-extra/clangd/SourceCode.cpp:74
+    if(UTF8Length == 1 || UTF8Length > 4) {
+      vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8);
+      // If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16
----------------
instead of printing the string, it would be better to print hex representation of current character, you can check `llvm::utohexstr`

================
Comment at: clang-tools-extra/clangd/SourceCode.cpp:75
+      vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8);
+      // If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16
+      if (CB(UTF8Length, UTF8Length == 1 ? 1 : 2))
----------------
we are already in a bad shape here, instead of making it more fluctuating can we always `CB(1, 1)`?

As long as you don't have any special reasoning for it.

================
Comment at: clang-tools-extra/clangd/unittests/SourceCodeTests.cpp:64
+  // Invalid UTF16
+  const char invalidUTF16[6] = {
+      static_cast<char>(0xFF), static_cast<char>(0xFF), static_cast<char>(0xFF),
----------------
 I believe in addition to being invalidutf16, this is also extended ascii (representing multiple `ÿ`s).

if we want clangd to work for extended ascii. I suppose the length should rather be 5 instead, as in utf-8 case. Since all of these bytes are representing a different character.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74731/new/

https://reviews.llvm.org/D74731