[PATCH] Make sure BitcodeWriter works with Unicode characters

Keno Fischer kfischer at college.harvard.edu
Sun Nov 9 20:04:20 PST 2014


On Sun, Nov 9, 2014 at 10:56 PM, Duncan P. N. Exon Smith
<dexonsmith at apple.com> wrote:
>
>> On 2014 Nov 9, at 13:17, Keno Fischer <kfischer at college.harvard.edu> wrote:
>>
>> Previously when a metadata string contained unicode characters,
>> it would be incorrectly placed in the Record array because chars
>> are signed by default and hence characters with the high bit set
>> would get sign extended, but the bitcode writer was attempting
>> to write the lowest 8 bit of the now sign-extended value. This
>> caused an assertion failure later on. The fix is just to cast
>> the pointer to uint8_t* first to prevent sign extension.
>> This came up in the context for metadata strings, but I did a
>> quick pass and changed the other instances of this pattern in
>> the file as well.
>>
>> http://reviews.llvm.org/D6184
>>
>> Files:
>>  lib/Bitcode/Writer/BitcodeWriter.cpp
>>  test/Bitcode/unicode.ll
>
> Two comments:
>
>  1. Should unicode characters ever make it here?  I guess I'm not sure,
>     but I'd assumed the frontend should encode these directly.  I.e.,
>     I didn't think:
>
>         metadata !"0x11\0012\00clang version ☃\001\00\000\00\000"
>
>     was a valid thing for a frontend to produce.  It should already be
>     encoding:
>
>         metadata !"0x11\0012\00clang version \E2\98\83\001\00\000\00\000"
>
>     I don't really know though.

Ok, it doesn't matter for the bug though, the test case still crashes
with the second variant. If it's valid to have the unicode character
in there, I'd like to leave it, otherwise I'll adjust the test case.

>
>  2. Assuming I'm wrong about the previous point, the C-style casts seem
>     a little brittle.  I'd prefer adding new API to StringRef called
>     `ubegin()` and `uend()` (or something) that take care of the cast.

Fair enough, I can make a separate patch that does this.




More information about the llvm-commits mailing list