[LLVMdev] Bitcode format

Mon Sep 3 22:26:34 PDT 2007

Reid Spencer <rspencer <at> reidspencer.com> writes: 
> Hi Joshua,

Hi, thanks for the reply.

> On Mon, 2007-09-03 at 21:34 +0000, Joshua Haberman wrote:
> > I also have a few questions about the format:
> > 
> > - it appears that the only magic number in the file is 
> >   application-specific.  This seems unfortunate, because it means that
> >   application-neutral tools cannot be built that process bitcode files,
> >   since they could not reliably detect that the file is a bitcode file.
> >   It might seem like there is little room for application- neutral tools
> >   since almost all the data in the file is application-specific, but off
> >   the top of my head I can think of a few, like a tool to suggest
> >   abbreviations that would give a file better compression.
> 
> Wouldn't those abbreviations imply some application-specific knowledge
> of the format?

No -- unless I am mistaken about bitcode, abbreviations have no semantic
meaning.  They are nothing but a space optimization, and needn't even be
exposed to the application (at least for reading).

> I can't think of any situation where a generic tool could
> do anything other than outline the contents of the file by looking at
> the blocks, e.g. llvm-bcanalyzer.

I can think of several tools that could be application-neutral and very
useful:

- a tool that prints the hierarchy of blocks and their sizes in the file

- a tool that does more sophisticated space usage analysis.  For
  example, how much space are the abbreviations saving?  Could the file
  be more efficiently packed with different VBR choices?  Or with
  different abbreviations?  It could even spit out a file that is 
  semantically identical but smaller.

- a tool that dumps selected blocks from the file (possibly giving you 
  fine-grained control over what blocks, how many records from each 
  block, etc).

- a tool for creating arbitrary bitcode files by reading textual input 
  on stdin written in some language that is isomorphic to bitcode.  This
  could be useful for things like constructing bitcode files that don't
  follow the expectations of your application, so you can test that your
  application handles such corruption gracefully.

I know there must already be bitcode files flying around, and so I can 
see the difficulty of making any changes to the format.  What if we just
said that BC (2 bytes) is the magic number for the Bitcode format, and
that 0xC0DE (2 bytes) is the application-specific magic number for LLVM?

> > - the LLVM code assumes that several VBR fields can be at most 32 bits 
> >   (block ids, number of elements in an array, etc).  These assumptions
> >   seem quite reasonable: can they be considered part of the format and
> >   added to the document?
> 
> I don't see why not. The llvm bitcode documentation is specific to llvm
> anyway so the limits should be defined. 

Hmm, the Bitcode documentation [0] seems to contain both application-neutral
and LLVM-specific parts.  Specifically, only section 4 of that document
seems specific to LLVM.

In any case, when I submit my patch to that document, I'll put any
of these assumptions that seem supremely reasonable in there and see
what the feedback is.  I'm especially interested to hear from Chris,
since he wrote that document (and I assume designed the format).

> Thanks for your interest, Joshua.

Thanks for encouraging my participation.  And you can call me Josh.  :)

Josh

[0] http://llvm.org/releases/2.0/docs/BitCodeFormat.html