[PATCH] D45753: Lift JSON library from clang-tools-extra/clangd to llvm/Support.

Mon Apr 23 10:00:00 PDT 2018

sammccall marked 2 inline comments as done.
sammccall added a comment.

In https://reviews.llvm.org/D45753#1075026, @labath wrote:

> This looks good to me, but I do feel someone else should comment on the appropriateness of including this library in llvm/Support. Chandler is listed as the code owner of Support, and he used to have opinions on json parsers in the past, so maybe he would be a good candidate (?)

Good suggestion, I'll try to get this on his radar. Thanks!

================
Comment at: include/llvm/Support/JSON.h:457
+  // Maps a property to a field, if it exists.
+  template <typename T> bool map(const char *Prop, T &Out) {
+    assert(*this && "Must check this is an object before calling map()");
----------------
labath wrote:
> Could we use `ObjectKey` as the property type here?
Better, StringRef. (ObjectKey is just a maybe-owning stringref, and non-owning is fine here)

================
Comment at: lib/Support/JSON.cpp:283-364
+void Parser::encodeUtf8(uint32_t Rune, std::string &Out) {
+  if (Rune <= 0x7F) {
+    Out.push_back(Rune & 0x7F);
+  } else if (Rune <= 0x7FF) {
+    uint8_t FirstByte = 0xC0 | ((Rune & 0x7C0) >> 6);
+    uint8_t SecondByte = 0x80 | (Rune & 0x3F);
+    Out.push_back(FirstByte);
----------------
sammccall wrote:
> labath wrote:
> > Is there any reason `Support/ConvertUTF.h` cannot be used here? Is sounds like you just need the "lenient" conversion mode here.
> Done. That's really slow code with a really inconvenient API, but it shouldn't matter.
Hmm, I guess I only *thought* I ran the tests :-(

It turns out `DecodeUTF16ToUTF8` doesn't do the right thing - at least it doesn't do anything particularly compatible with JSON. The problematic cases are when UTF-16 surrogate code units appear without being properly paired.
 - a JSON parser has to accept these, as they conform to the grammar
 - the best behavior per Unicode is to replace them with U+FFFD
 - ConvertUTF in lenient mode simply drops them in most cases (permitted, but not recommended). When encountering a lone leading surrogate at the end of text, it returns an error even in lenient mode.

References:
http://seriot.ch/parsing_json.php
https://www.rfc-editor.org/errata_search.php?rfc=7159&eid=3984
http://unicode.org/review/pr-121.html

I've restored the previous code and added more comments, including reasons not to use ConvertUTF.  This implements unicode's preferred handling of invalid UTF-16 ("Replace each maximal subpart of the ill-formed subsequence by a single U+FFFD").

Repository:
  rL LLVM

https://reviews.llvm.org/D45753