[PATCH] Make sure BitcodeWriter works with Unicode characters

Duncan P. N. Exon Smith dexonsmith at apple.com
Sun Nov 9 20:06:35 PST 2014


> On 2014 Nov 9, at 20:04, Keno Fischer <kfischer at college.harvard.edu> wrote:
> 
> On Sun, Nov 9, 2014 at 10:56 PM, Duncan P. N. Exon Smith
> <dexonsmith at apple.com> wrote:
>> 
>>> On 2014 Nov 9, at 13:17, Keno Fischer <kfischer at college.harvard.edu> wrote:
>>> 
>>> Previously when a metadata string contained unicode characters,
>>> it would be incorrectly placed in the Record array because chars
>>> are signed by default and hence characters with the high bit set
>>> would get sign extended, but the bitcode writer was attempting
>>> to write the lowest 8 bit of the now sign-extended value. This
>>> caused an assertion failure later on. The fix is just to cast
>>> the pointer to uint8_t* first to prevent sign extension.
>>> This came up in the context for metadata strings, but I did a
>>> quick pass and changed the other instances of this pattern in
>>> the file as well.
>>> 
>>> http://reviews.llvm.org/D6184
>>> 
>>> Files:
>>> lib/Bitcode/Writer/BitcodeWriter.cpp
>>> test/Bitcode/unicode.ll
>> 
>> Two comments:
>> 
>> 1. Should unicode characters ever make it here?  I guess I'm not sure,
>>    but I'd assumed the frontend should encode these directly.  I.e.,
>>    I didn't think:
>> 
>>        metadata !"0x11\0012\00clang version ☃\001\00\000\00\000"
>> 
>>    was a valid thing for a frontend to produce.  It should already be
>>    encoding:
>> 
>>        metadata !"0x11\0012\00clang version \E2\98\83\001\00\000\00\000"
>> 
>>    I don't really know though.
> 
> Ok, it doesn't matter for the bug though, the test case still crashes
> with the second variant. If it's valid to have the unicode character
> in there, I'd like to leave it, otherwise I'll adjust the test case.

Oh, I see.  Of course it still crashes.  Yeah, you should probably adjust
the testcase.

>> 2. Assuming I'm wrong about the previous point, the C-style casts seem
>>    a little brittle.  I'd prefer adding new API to StringRef called
>>    `ubegin()` and `uend()` (or something) that take care of the cast.
> 
> Fair enough, I can make a separate patch that does this.





More information about the llvm-commits mailing list