[PATCH] Make sure BitcodeWriter works with Unicode characters

Sun Nov 9 19:56:59 PST 2014

> On 2014 Nov 9, at 13:17, Keno Fischer <kfischer at college.harvard.edu> wrote:
> 
> Previously when a metadata string contained unicode characters,
> it would be incorrectly placed in the Record array because chars
> are signed by default and hence characters with the high bit set
> would get sign extended, but the bitcode writer was attempting
> to write the lowest 8 bit of the now sign-extended value. This
> caused an assertion failure later on. The fix is just to cast
> the pointer to uint8_t* first to prevent sign extension.
> This came up in the context for metadata strings, but I did a
> quick pass and changed the other instances of this pattern in
> the file as well.
> 
> http://reviews.llvm.org/D6184
> 
> Files:
>  lib/Bitcode/Writer/BitcodeWriter.cpp
>  test/Bitcode/unicode.ll

Two comments:

 1. Should unicode characters ever make it here?  I guess I'm not sure,
    but I'd assumed the frontend should encode these directly.  I.e.,
    I didn't think:

        metadata !"0x11\0012\00clang version ☃\001\00\000\00\000"

    was a valid thing for a frontend to produce.  It should already be
    encoding:

        metadata !"0x11\0012\00clang version \E2\98\83\001\00\000\00\000"

    I don't really know though.

 2. Assuming I'm wrong about the previous point, the C-style casts seem
    a little brittle.  I'd prefer adding new API to StringRef called
    `ubegin()` and `uend()` (or something) that take care of the cast.