[PATCH] D46035: [clangd] Fix unicode handling, using UTF-16 where LSP requires it.

Ben Hamilton via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Mon Apr 30 08:57:56 PDT 2018


benhamilton added inline comments.


================
Comment at: clang-tools-extra/trunk/clangd/SourceCode.cpp:38
+    // 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
+    // 11111xxx is not valid UTF-8 at all. Assert because it's probably our bug.
+    assert((UTF8Length >= 2 && UTF8Length <= 4) &&
----------------
sammccall wrote:
> benhamilton wrote:
> > This is user input, right? Have we actually checked for valid UTF-8, or do we just assume it's valid?
> > 
> > If not, it seems like an assertion is not the right check, but we should reject it when we're reading the input.
> > 
> Yeah, I wasn't sure about this, offline discussion tentatively concluded we wanted an assert, but I'm happy to switch to something else.
> 
> We don't validate the code on the way in, so strings are "bytes of presumed-UTF8". This is usually not a big pain actually. But we could/should certainly make the JSON parser validate the UTF-8. (If we want to go this route, D45753 should be resolved first).
> 
> There's two ways the assertion could fire: the code is invalid UTF-8, or there's a bug in the unicode logic here. I thought the latter was more likely at least in the short-term :) and this is the least invasive way to catch it. And if a developer build (assert-enabled) crashes because an editor feeds it invalid bytes, then that's probably better than doing nothing (though not as good as catching the error earlier).
The problem with not validating is it's easy to cause OOB memory access (and thus security issues) if someone crafts malicious UTF-8 and makes us read off the end of a string.

We should be clear about the status of all strings in the documentation to APIs.


Repository:
  rL LLVM

https://reviews.llvm.org/D46035





More information about the llvm-commits mailing list