[PATCH] Make sure BitcodeWriter works with Unicode characters
Duncan P. N. Exon Smith
dexonsmith at apple.com
Sun Nov 9 19:56:59 PST 2014
> On 2014 Nov 9, at 13:17, Keno Fischer <kfischer at college.harvard.edu> wrote:
>
> Previously when a metadata string contained unicode characters,
> it would be incorrectly placed in the Record array because chars
> are signed by default and hence characters with the high bit set
> would get sign extended, but the bitcode writer was attempting
> to write the lowest 8 bit of the now sign-extended value. This
> caused an assertion failure later on. The fix is just to cast
> the pointer to uint8_t* first to prevent sign extension.
> This came up in the context for metadata strings, but I did a
> quick pass and changed the other instances of this pattern in
> the file as well.
>
> http://reviews.llvm.org/D6184
>
> Files:
> lib/Bitcode/Writer/BitcodeWriter.cpp
> test/Bitcode/unicode.ll
Two comments:
1. Should unicode characters ever make it here? I guess I'm not sure,
but I'd assumed the frontend should encode these directly. I.e.,
I didn't think:
metadata !"0x11\0012\00clang version ☃\001\00\000\00\000"
was a valid thing for a frontend to produce. It should already be
encoding:
metadata !"0x11\0012\00clang version \E2\98\83\001\00\000\00\000"
I don't really know though.
2. Assuming I'm wrong about the previous point, the C-style casts seem
a little brittle. I'd prefer adding new API to StringRef called
`ubegin()` and `uend()` (or something) that take care of the cast.
More information about the llvm-commits
mailing list