[PATCH] D46274: [Support] Harden JSON against invalid UTF-8.
Ben Hamilton via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Apr 30 10:50:53 PDT 2018
benhamilton requested changes to this revision.
benhamilton added a comment.
This revision now requires changes to proceed.
Looks good, just missed a few edge cases for 4-byte sequences.
Also need to reject the CESU-8 encoding where UTF-16 surrogate pairs are represented as two 3-byte UTF-8 sequences.
================
Comment at: lib/Support/JSON.cpp:537
+ case 2: // 110xxxxx 10xxxxxx.
+ // U+80 = C2 80 is the first two-byte character.
+ if (C < 0xC2 || !EatTrailing())
----------------
Can we clarify the comment that this logic results in checking for the shortest possible encoding?
================
Comment at: lib/Support/JSON.cpp:541
+ return 2;
+ case 3: // 1110xxxx 10xxxxxx 10xxxxxx.
+ // U+800 = E0 A0 80 is the first three-byte character.
----------------
Also need to check for and reject so-called CESU-8 encoding (where UTF-16 surrogate pairs are "encoded" as separate 3-byte UTF-8 sequences):
https://www.unicode.org/reports/tr26/#definitions
================
Comment at: lib/Support/JSON.cpp:552
+ // U+10000 = F0 90 80 80 is the first three-byte character.
+ if (C == 0xF0) {
+ if (EatTrailing() < 0x90 || !EatTrailing() || !EatTrailing())
----------------
Also need to handle two more cases which would encode a code point > `U+10FFFF`, which is not allowed:
1) First byte `== 0xF4` and second byte `> 0x8F`
2) First byte `> 0xF4`
================
Comment at: lib/Support/JSON.cpp:572
+ for (size_t I = 0; I < S.size();)
+ if (!measureChar(S, I)) {
+ if (ErrOffset)
----------------
```
if (!LLVM_LIKELY(measureChar(S, I)) {
```
Repository:
rL LLVM
https://reviews.llvm.org/D46274
More information about the llvm-commits
mailing list