[LLVMdev] Bitcode format

Thu Sep 6 23:55:25 PDT 2007

On Tue, 4 Sep 2007, Joshua Haberman wrote:
>>>   the top of my head I can think of a few, like a tool to suggest
>>>   abbreviations that would give a file better compression.
>>
>> Wouldn't those abbreviations imply some application-specific knowledge
>> of the format?
>
> No -- unless I am mistaken about bitcode, abbreviations have no semantic
> meaning.  They are nothing but a space optimization, and needn't even be
> exposed to the application (at least for reading).

You're absolutely right, that is by design. :)

>> I can't think of any situation where a generic tool could
>> do anything other than outline the contents of the file by looking at
>> the blocks, e.g. llvm-bcanalyzer.
>
> I can think of several tools that could be application-neutral and very
> useful:

Absolutely.  The best way to think about the bitcode format is as an 
application-independent container format... just like XML.  Unlike XML, it 
is designed to be a dense binary format, but it should be trivial to 
transform bitcode to XML and back (for example).  There are lots of tools 
that work on arbitrary XML files so I don't see why there wouldn't be 
application independent tools for bitcode.

> - a tool for creating arbitrary bitcode files by reading textual input
>  on stdin written in some language that is isomorphic to bitcode.  This
>  could be useful for things like constructing bitcode files that don't
>  follow the expectations of your application, so you can test that your
>  application handles such corruption gracefully.

Yep.  If this is your goal, I think it would be useful to add some new 
standardized records to blockinfo, which allows the file to give textual 
names to blocks.  For example, a bitcode file could say that block id 
#1234 is the "foo" block.  If converting to/from XML, you'd then print it 
as <foo>...</foo>.

> I know there must already be bitcode files flying around, and so I can
> see the difficulty of making any changes to the format.  What if we just
> said that BC (2 bytes) is the magic number for the Bitcode format, and
> that 0xC0DE (2 bytes) is the application-specific magic number for LLVM?

Sure, this makes sense to me.

> In any case, when I submit my patch to that document, I'll put any
> of these assumptions that seem supremely reasonable in there and see
> what the feedback is.  I'm especially interested to hear from Chris,
> since he wrote that document (and I assume designed the format).

It sounds like you're right on in all counts.  Thanks for helping 
improving the documentation, I'm glad you find the format interesting and 
useful.  In the future, I'll probably also extend it to encode tree 
structures more efficiently (important for ASTs in the new C frontend).

-Chris

-- 
http://nondot.org/sabre/
http://llvm.org/