[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.

Fri Feb 21 05:56:40 PST 2020

sammccall added a comment.

> Proposing a fix to extend the ascii handling code to take extended ascii as well.

Please reduce the scope of this patch to not crashing on invalid utf-8.

Handling other encodings is in principle possible, but requires more work and a more precise understanding of what we're trying to do.
(In particular, "extended ascii" doesn't have a precise meaning)

================
Comment at: clang-tools-extra/clangd/SourceCode.cpp:74
+    if(UTF8Length == 1 || UTF8Length > 4) {
+      vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8);
+      // If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16
----------------
kadircet wrote:
> instead of printing the string, it would be better to print hex representation of current character, you can check `llvm::utohexstr`
Printing UTF8length isn't useful here., and I don't think current character is enough as it may not be where the error is (e.g. in the 10xxx case).

I'd suggest llvm::toHex(U8) and printing the offset I. Printing the bytes is better than the text as we know we're misinterpreting the bytes!

(In practice, these strings are parts of a single line, so it shouldn't be too long)

================
Comment at: clang-tools-extra/clangd/SourceCode.cpp:74
+    if(UTF8Length == 1 || UTF8Length > 4) {
+      vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8);
+      // If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16
----------------
sammccall wrote:
> kadircet wrote:
> > instead of printing the string, it would be better to print hex representation of current character, you can check `llvm::utohexstr`
> Printing UTF8length isn't useful here., and I don't think current character is enough as it may not be where the error is (e.g. in the 10xxx case).
> 
> I'd suggest llvm::toHex(U8) and printing the offset I. Printing the bytes is better than the text as we know we're misinterpreting the bytes!
> 
> (In practice, these strings are parts of a single line, so it shouldn't be too long)
also I think you want to set a flag and vlog only the first time around the loop.

================
Comment at: clang-tools-extra/clangd/SourceCode.cpp:75
+      vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8);
+      // If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16
+      if (CB(UTF8Length, UTF8Length == 1 ? 1 : 2))
----------------
kadircet wrote:
> we are already in a bad shape here, instead of making it more fluctuating can we always `CB(1, 1)`?
> 
> As long as you don't have any special reasoning for it.
Yeah, this isn't a principled recovery strategy, so CB(1, 1) is the way to go I think. 
With a comment indicating that we're not returning anything  really useful, like "// Treat this byte as a character, for lack of better options".
(And CB(2, 1) seems silly if you think this text is actually ISO-8859-x - that's a single-byte encoding!)

If we really wanted to do something principled, we could validate the UTF-8 first (llvm::isUTF8 in JSON.h, though it's missing some cases), and if it fails, invoke CB(1, 1) for each character under the assumption that it's ISO-8859-x so one-byte per char and all chars are in BMP. But I'd urge against it, I don't think this will actually end up working reliably end-to-end.

================
Comment at: clang-tools-extra/clangd/unittests/SourceCodeTests.cpp:60
   EXPECT_EQ(lspLength("😂"), 2UL);
+  // Extended ASCII
+  const char extendedASCIIStr[2] = {static_cast<char>(160U), '\0'};
----------------
"Extended ASCII" is vague and misleading. Please call these something like "Invalid UTF-8" or "ISO-8859-x misinterpreted as UTF-8" or so

================
Comment at: clang-tools-extra/clangd/unittests/SourceCodeTests.cpp:62
+  const char extendedASCIIStr[2] = {static_cast<char>(160U), '\0'};
+  EXPECT_EQ(lspLength(extendedASCIIStr), 1UL);
+  // Invalid UTF16
----------------
I don't think we should be asserting the exact values here, it's garbage and depends on implementation details.

The most important thing is to call it and verify it doesn't crash.
We could assert that 0 <= lspLength <= strlen, which lspLength generally obeys (for all encodings)

================
Comment at: clang-tools-extra/clangd/unittests/SourceCodeTests.cpp:64
+  // Invalid UTF16
+  const char invalidUTF16[6] = {
+      static_cast<char>(0xFF), static_cast<char>(0xFF), static_cast<char>(0xFF),
----------------
kadircet wrote:
>  I believe in addition to being invalidutf16, this is also extended ascii (representing multiple `ÿ`s).
> 
> if we want clangd to work for extended ascii. I suppose the length should rather be 5 instead, as in utf-8 case. Since all of these bytes are representing a different character.
I'm just really confused by this test case - we never interpret or convert it to UTF-16 so why is it called `invalidUTF16`? what is it supposed to be testing?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74731/new/

https://reviews.llvm.org/D74731