<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Sep 28, 2016, at 2:03 PM, Xinliang David Li <<a href="mailto:xinliangli@gmail.com" class="">xinliangli@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><br class=""><div class="gmail_extra"><br class=""><div class="gmail_quote">On Tue, Sep 27, 2016 at 12:56 PM, Mehdi Amini via llvm-dev <span dir="ltr" class=""><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank" class="">llvm-dev@lists.llvm.org</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello,<br class="">
<br class="">
Disclaimer: I have no plan to break 3.x bitcode compatibility in any way.<br class="">
<br class="">
I’m experimenting with ideas to improve the current bitcode serialization and deserialization in general.<br class="">
I’m interested in improving the speed and the flexibility of the reader/writer, as well as making the code easier to deal with.<br class="">
<br class="">
My current griefs against our bitcode format are mainly the lack of built-in random access (which prevents efficient lazy loading), and the implementation which does not help layering the high-level structure of the serialization from what is emitted (and the compression part). On the speed side, the bit level management of record has also non-negligible overhead right now.<br class="">
The two key issues with the current format that prevents random access are:<br class="">
<br class=""></blockquote><div class=""><br class=""></div><div class="">What are most common IR entities that need fast random accesses? </div></div></div></div></div></blockquote><div><br class=""></div><div>It really depends on the use case: if you want to load a full module at once, we can get away with all but Metadatas.</div><div>The only things that have cycles are named types and instructions, but these have fast RAUW support anyway and the writer can lay them in order so that most of the time operands are parsed before the user.</div><div><br class=""></div><div>For lazy-load or selective loading, random access helps a lot for everything but instructions (if you materialize a function then you read all of the instructions anyway).</div><div><br class=""></div><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class="gmail_quote"><div class=""> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
- Abbrev are lazily emitted in the stream and can be interleaved with record. You have to scan the whole block to get the context for a record.<br class="">
- Because records have variable size, you need to parse then in sequence to find the “interesting” one.<br class="">
<br class="">
I just finished a quick prototype using Google Flatbuffers ( <a href="http://google.github.io/flatbuffers/" rel="noreferrer" target="_blank" class="">http://google.github.io/<wbr class="">flatbuffers/</a> ). This is a solution that provide “zero-copy” serialization/deserialization, by avoiding the use of variable length integer and aligning appropriately every field. The structure of the serialization is described using a DSL, and the serialization/deserialization code is generated from this description.<br class="">
My prototype is a naive description of the IR, but complete enough to handle all the .ll files present in the LLVM validation, and rountrip a 800MB file from the LTO link of Clang itself.<br class="">
<br class="">
The results is that the extra flexibility given by the random access is helping beyond the lazy loading: for instance loading metadata is an issue because of forward reference usually, while the random access allows to “recurse” on the operand when loading a metadata. This simplifies the algorithms both on the reader and the writer.<br class="">
<br class="">
On the drawback, Flatbuffers seems to target small use cases (not a lot of different type of records or large buffers). The implementation is not totally optimized, for example 40% of the writer time is spend on “vtables” (equivalent of our abbrev) deduplication on some of my test cases.<br class="">
Also multiple possible obvious size optimizations are not possible with the current Flatbuffers, currently my prototype emit files that are ~2 times larger that our bitcode format. By working on the schema I probably can get this to down 50%, but that may still be too high for us (especially considering that some of these limits are inherent to Flatbuffer implementation itself).<br class=""></blockquote><div class=""><br class=""></div><div class="">Do you have read/write time comparison data (for sequential and random access)?</div></div></div></div></div></blockquote><div><br class=""></div><div>Keeping in mind that I have a naive implementation (both size and speed): the writer is 10% slower (but spend 40% of its time in the not-so-smart-implementation of flatbuffers vtable deduplication). </div><div>The reader can be ~40% faster for loading a full module (LTO merged module of clang with debug info), but even a lot faster when you want to load only one function from the module.</div><div>The current lazy-loading scheme still forces us to create a declaration for every global/alias/function and load all the global constants expressions. (Duncan did a bunch of work this year to help ThinLTO on this issue by storing more record nested inside the function blocks and not at the global level).</div><div><br class=""></div><div><br class=""></div><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class="gmail_extra"><div class="gmail_quote"><div class=""> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br class="">
So my current thoughts on the topic is that we could still learn from other recently developed format and bring part of the ideas into our format where it makes sense. For example we don’t need random access in the instruction array, but it’d be appreciable for the metadata though. We'd want zero-copy for a symbol table (see <a href="https://llvm.org/bugs/show_bug.cgi?id=27551" rel="noreferrer" target="_blank" class="">https://llvm.org/bugs/show_<wbr class="">bug.cgi?id=27551</a> ) but not for the instruction stream.<br class="">
<br class="">
I’m also interested in what are other people’s stakes on the bitcode format (other than backward compatibility), what tradeoff are you ready to make on the size vs speed/flexibility for example?<br class="">
<br class=""></blockquote><div class=""><br class=""></div><div class="">A hybrid format (using fixed length representation or indirection for parts that need fast random access, the rest uses compression) seems promising. Why can't backward compatibility be kept?</div></div></div></div></div></blockquote><div><br class=""></div><div>I don’t plan to lose backward compatibility in any way. Was you question intended to be "why can’t backward compatibility be dropped?" or did you misread my disclaimer at the top of the email?</div><div><br class=""></div><div><br class=""></div><div>— </div><div>Mehdi</div><div><br class=""></div></div></body></html>