[PATCH] D46035: [clangd] Fix unicode handling, using UTF-16 where LSP requires it.

Sam McCall via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Mon Apr 30 01:42:14 PDT 2018


sammccall added inline comments.


================
Comment at: clang-tools-extra/trunk/clangd/SourceCode.cpp:38
+    // 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
+    // 11111xxx is not valid UTF-8 at all. Assert because it's probably our bug.
+    assert((UTF8Length >= 2 && UTF8Length <= 4) &&
----------------
benhamilton wrote:
> This is user input, right? Have we actually checked for valid UTF-8, or do we just assume it's valid?
> 
> If not, it seems like an assertion is not the right check, but we should reject it when we're reading the input.
> 
Yeah, I wasn't sure about this, offline discussion tentatively concluded we wanted an assert, but I'm happy to switch to something else.

We don't validate the code on the way in, so strings are "bytes of presumed-UTF8". This is usually not a big pain actually. But we could/should certainly make the JSON parser validate the UTF-8. (If we want to go this route, D45753 should be resolved first).

There's two ways the assertion could fire: the code is invalid UTF-8, or there's a bug in the unicode logic here. I thought the latter was more likely at least in the short-term :) and this is the least invasive way to catch it. And if a developer build (assert-enabled) crashes because an editor feeds it invalid bytes, then that's probably better than doing nothing (though not as good as catching the error earlier).


Repository:
  rL LLVM

https://reviews.llvm.org/D46035





More information about the llvm-commits mailing list