[PATCH] D46274: [Support] Harden JSON against invalid UTF-8.

Ben Hamilton via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Mon Apr 30 10:50:53 PDT 2018


benhamilton requested changes to this revision.
benhamilton added a comment.
This revision now requires changes to proceed.

Looks good, just missed a few edge cases for 4-byte sequences.

Also need to reject the CESU-8 encoding where UTF-16 surrogate pairs are represented as two 3-byte UTF-8 sequences.



================
Comment at: lib/Support/JSON.cpp:537
+  case 2: // 110xxxxx 10xxxxxx.
+    // U+80 = C2 80 is the first two-byte character.
+    if (C < 0xC2 || !EatTrailing())
----------------
Can we clarify the comment that this logic results in checking for the shortest possible encoding?


================
Comment at: lib/Support/JSON.cpp:541
+    return 2;
+  case 3: // 1110xxxx 10xxxxxx 10xxxxxx.
+    // U+800 = E0 A0 80 is the first three-byte character.
----------------
Also need to check for and reject so-called CESU-8 encoding (where UTF-16 surrogate pairs are "encoded" as separate 3-byte UTF-8 sequences):

https://www.unicode.org/reports/tr26/#definitions



================
Comment at: lib/Support/JSON.cpp:552
+    // U+10000 = F0 90 80 80 is the first three-byte character.
+    if (C == 0xF0) {
+      if (EatTrailing() < 0x90 || !EatTrailing() || !EatTrailing())
----------------
Also need to handle two more cases which would encode a code point > `U+10FFFF`, which is not allowed:

1) First byte `== 0xF4` and second byte `> 0x8F`
2) First byte `> 0xF4`



================
Comment at: lib/Support/JSON.cpp:572
+  for (size_t I = 0; I < S.size();)
+    if (!measureChar(S, I)) {
+      if (ErrOffset)
----------------
```
if (!LLVM_LIKELY(measureChar(S, I)) {
```



Repository:
  rL LLVM

https://reviews.llvm.org/D46274





More information about the llvm-commits mailing list