ruiu added a comment. I wonder if you also want to replace surrogate codepoint with U+FFFD, as UTF-8 string should not contain any codepoint between U+https://reviews.llvm.org/D800 and U+DFFF. Repository: rL LLVM https://reviews.llvm.org/D46274