[PATCH] D74731: [Clangd] Fixed assertion when processing extended ASCII characters.
Kadir Cetinkaya via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Fri Feb 21 01:44:26 PST 2020
kadircet added inline comments.
================
Comment at: clang-tools-extra/clangd/SourceCode.cpp:70
size_t UTF8Length = llvm::countLeadingOnes(C);
- // 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
- // 11111xxx is not valid UTF-8 at all. Assert because it's probably our bug.
- assert((UTF8Length >= 2 && UTF8Length <= 4) &&
- "Invalid UTF-8, or transcoding bug?");
I += UTF8Length; // Skip over all trailing bytes.
+ // 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
----------------
can we also move this into valid case, and only increment by one on the invalid(extended ascii) case ?
as we are trying to recover from extended ascii, and they are 1 byte in length, not multiple.
================
Comment at: clang-tools-extra/clangd/SourceCode.cpp:71
I += UTF8Length; // Skip over all trailing bytes.
- // A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
- // Astral codepoints are encoded as 4 bytes in UTF-8 (11110xxx ...)
- if (CB(UTF8Length, UTF8Length == 4 ? 2 : 1))
- return true;
+ // 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
+ // 11111xxx is not valid UTF-8 at all. Log it because it's probably our bug.
----------------
could you re-arrange this part into
```
if (LLVM_LIKELY(UTF8Length >= 2 && UTF8Length <=4)) {
// A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
// Astral codepoints are encoded as 4 bytes in UTF-8 (11110xxx ...)
if (CB(UTF8Length, UTF8Length >> 1))
return true;
continue;
}
vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8);
// If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16
if (CB(UTF8Length, UTF8Length == 1 ? 1 : 2))
return true;
```
================
Comment at: clang-tools-extra/clangd/SourceCode.cpp:74
+ if(UTF8Length == 1 || UTF8Length > 4) {
+ vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8);
+ // If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16
----------------
instead of printing the string, it would be better to print hex representation of current character, you can check `llvm::utohexstr`
================
Comment at: clang-tools-extra/clangd/SourceCode.cpp:75
+ vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8);
+ // If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16
+ if (CB(UTF8Length, UTF8Length == 1 ? 1 : 2))
----------------
we are already in a bad shape here, instead of making it more fluctuating can we always `CB(1, 1)`?
As long as you don't have any special reasoning for it.
================
Comment at: clang-tools-extra/clangd/unittests/SourceCodeTests.cpp:64
+ // Invalid UTF16
+ const char invalidUTF16[6] = {
+ static_cast<char>(0xFF), static_cast<char>(0xFF), static_cast<char>(0xFF),
----------------
I believe in addition to being invalidutf16, this is also extended ascii (representing multiple `ΓΏ`s).
if we want clangd to work for extended ascii. I suppose the length should rather be 5 instead, as in utf-8 case. Since all of these bytes are representing a different character.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D74731/new/
https://reviews.llvm.org/D74731
More information about the cfe-commits
mailing list