[llvm-dev] Experiments with a replacement for the bitcode format
Xinliang David Li via llvm-dev
llvm-dev at lists.llvm.org
Wed Sep 28 14:03:07 PDT 2016
On Tue, Sep 27, 2016 at 12:56 PM, Mehdi Amini via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Disclaimer: I have no plan to break 3.x bitcode compatibility in any way.
> I’m experimenting with ideas to improve the current bitcode serialization
> and deserialization in general.
> I’m interested in improving the speed and the flexibility of the
> reader/writer, as well as making the code easier to deal with.
> My current griefs against our bitcode format are mainly the lack of
> built-in random access (which prevents efficient lazy loading), and the
> implementation which does not help layering the high-level structure of the
> serialization from what is emitted (and the compression part). On the speed
> side, the bit level management of record has also non-negligible overhead
> right now.
> The two key issues with the current format that prevents random access are:
What are most common IR entities that need fast random accesses?
> - Abbrev are lazily emitted in the stream and can be interleaved with
> record. You have to scan the whole block to get the context for a record.
> - Because records have variable size, you need to parse then in sequence
> to find the “interesting” one.
> I just finished a quick prototype using Google Flatbuffers (
> http://google.github.io/flatbuffers/ ). This is a solution that provide
> “zero-copy” serialization/deserialization, by avoiding the use of variable
> length integer and aligning appropriately every field. The structure of the
> serialization is described using a DSL, and the
> serialization/deserialization code is generated from this description.
> My prototype is a naive description of the IR, but complete enough to
> handle all the .ll files present in the LLVM validation, and rountrip a
> 800MB file from the LTO link of Clang itself.
> The results is that the extra flexibility given by the random access is
> helping beyond the lazy loading: for instance loading metadata is an issue
> because of forward reference usually, while the random access allows to
> “recurse” on the operand when loading a metadata. This simplifies the
> algorithms both on the reader and the writer.
> On the drawback, Flatbuffers seems to target small use cases (not a lot of
> different type of records or large buffers). The implementation is not
> totally optimized, for example 40% of the writer time is spend on “vtables”
> (equivalent of our abbrev) deduplication on some of my test cases.
> Also multiple possible obvious size optimizations are not possible with
> the current Flatbuffers, currently my prototype emit files that are ~2
> times larger that our bitcode format. By working on the schema I probably
> can get this to down 50%, but that may still be too high for us (especially
> considering that some of these limits are inherent to Flatbuffer
> implementation itself).
Do you have read/write time comparison data (for sequential and random
> So my current thoughts on the topic is that we could still learn from
> other recently developed format and bring part of the ideas into our format
> where it makes sense. For example we don’t need random access in the
> instruction array, but it’d be appreciable for the metadata though. We'd
> want zero-copy for a symbol table (see https://llvm.org/bugs/show_
> bug.cgi?id=27551 ) but not for the instruction stream.
> I’m also interested in what are other people’s stakes on the bitcode
> format (other than backward compatibility), what tradeoff are you ready to
> make on the size vs speed/flexibility for example?
A hybrid format (using fixed length representation or indirection for parts
that need fast random access, the rest uses compression) seems promising.
Why can't backward compatibility be kept?
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev