[PATCH] Make sure BitcodeWriter works with Unicode characters
Duncan P. N. Exon Smith
dexonsmith at apple.com
Sun Nov 9 20:06:35 PST 2014
> On 2014 Nov 9, at 20:04, Keno Fischer <kfischer at college.harvard.edu> wrote:
>
> On Sun, Nov 9, 2014 at 10:56 PM, Duncan P. N. Exon Smith
> <dexonsmith at apple.com> wrote:
>>
>>> On 2014 Nov 9, at 13:17, Keno Fischer <kfischer at college.harvard.edu> wrote:
>>>
>>> Previously when a metadata string contained unicode characters,
>>> it would be incorrectly placed in the Record array because chars
>>> are signed by default and hence characters with the high bit set
>>> would get sign extended, but the bitcode writer was attempting
>>> to write the lowest 8 bit of the now sign-extended value. This
>>> caused an assertion failure later on. The fix is just to cast
>>> the pointer to uint8_t* first to prevent sign extension.
>>> This came up in the context for metadata strings, but I did a
>>> quick pass and changed the other instances of this pattern in
>>> the file as well.
>>>
>>> http://reviews.llvm.org/D6184
>>>
>>> Files:
>>> lib/Bitcode/Writer/BitcodeWriter.cpp
>>> test/Bitcode/unicode.ll
>>
>> Two comments:
>>
>> 1. Should unicode characters ever make it here? I guess I'm not sure,
>> but I'd assumed the frontend should encode these directly. I.e.,
>> I didn't think:
>>
>> metadata !"0x11\0012\00clang version ☃\001\00\000\00\000"
>>
>> was a valid thing for a frontend to produce. It should already be
>> encoding:
>>
>> metadata !"0x11\0012\00clang version \E2\98\83\001\00\000\00\000"
>>
>> I don't really know though.
>
> Ok, it doesn't matter for the bug though, the test case still crashes
> with the second variant. If it's valid to have the unicode character
> in there, I'd like to leave it, otherwise I'll adjust the test case.
Oh, I see. Of course it still crashes. Yeah, you should probably adjust
the testcase.
>> 2. Assuming I'm wrong about the previous point, the C-style casts seem
>> a little brittle. I'd prefer adding new API to StringRef called
>> `ubegin()` and `uend()` (or something) that take care of the cast.
>
> Fair enough, I can make a separate patch that does this.
More information about the llvm-commits
mailing list