[LLVMdev] Re: Bytecodes & docs

Tue Aug 17 16:48:20 PDT 2004

Reid,

Thanks for the detailed feedback.

A value of zero now means zero literal for everything except labels, 
right?  There is kind of a vague reference to this in the 1.0 -> 1.1 
section I believe.  You might want to make this clearer when talking about 
values in the body of the document.
--> A comment on this: if a value of zero were never used for labels, that 
would make me happy, because then my code could replace references to zero 
with literal zero always and always subtract 1 from the value if not zero 
to index into my type/value table.

After reading through the upgrade sections, it seemed to me that there are 
several things mentioned there that ought to also be mentioned in the body 
of the bytecode document.  I admit I'm lazy, so I usually only read upgrade 
sections of a doc when I'm busy upgrading to the next version.  Here's a 
vote for making the docs more friendly to lazy skimmers like myself.

More comments below:

At 08:56 PM 8/16/2004, you wrote:
>From: Reid Spencer <reid at x10sys.com>
>Precedence: list
>Subject: Re: [LLVMdev] Bytecode file bugs / doc bugs
>Date: Mon, 16 Aug 2004 17:56:22 -0700
>To: LLVM Developers Mailing List <llvmdev at cs.uiuc.edu>
>References: <6.0.3.0.2.20040816131856.02beba60 at mail.mykland.com>
>In-Reply-To: <6.0.3.0.2.20040816131856.02beba60 at mail.mykland.com>
>Reply-To: LLVM Developers Mailing List <llvmdev at cs.uiuc.edu>
>Message-ID: <1092704181.19366.307.camel at bashful.x10sys.com>
>Content-Type: multipart/signed; micalg=pgp-sha1;
>         protocol="application/pgp-signature";
>         boundary="=-5N8aoZTwjHs3Bi3uXvgc"
>MIME-Version: 1.0
>Message: 4
>
>Robert,
>
>Here's my more detailed feedback, inline with your comments. The patch
>for the changes to the documentation I've made is here:
>
>http://mail.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20040816/017230.html
>
>Reid.
>
>On Mon, 2004-08-16 at 13:49, Robert Mykland wrote:
> > Dear Reid and Chris,
> >
> > I thought I should send this to the list in case anyone else is struggling
> > to interpret bytecode files with the new docs.
> >
> > (1)     First a bug I already mentioned to Reid.  Unlike the other new
> > module headers module 0x01 still uses the old 32-bit and 32-bit format
> > instead of the new 5-bit and 27-bit format.  Thus the first module in the
> > file will be 0x00000001 followed by a 32-bit size.
>
>This is a doc bug. Its fixed per my previous email.

Reid explained to me in seperate email that you need the extra five bits in 
this particular case to support large executable files.  This makes a lot 
of sense.

I had an idea about this.  If the size parameters represented 32-bit words 
instead of bytes (since they're aligned anyway) this would allow you to use 
the same format even for the first one.

Another way to save space with headers would be to make the combined 
ID/size field a variable byte encoded field.

> >
> > (2)     Though it states in the docs that module size numbers do not
> > include padding bytes, they often do.  Here's an incomplete list of which
> > do and which don't:
> >
> > module 0x01 includes align bytes
> > GTP block size does not include align bytes
> > module global info does
> > function definitions sometimes do and sometimes don't
> > compaction tables do not
> > instruction list -- unknown
> > function symbol table does not
>
>Hrm. This is not good. I'm going to revisit the whole need for alignment
>in bytecode files. I need to check with Chris about why this is
>necessary in the first place. There's definitely something screwball
>here and it might take a code change (yet another bytecode format change
>in 1.4) to fix it.
>
>I'll provide more feedback on this point later.
>
> >
> > (3)     In Reid's documentation, his "opcode" link is bad.  His doc does
> > not yet contain an opcode section.  Presumably this would contain the info
> > from the include file Instruction.def.
>
>This is also a doc bug. Its fixed by just referencing the
>Instruction.def file on the cvsweb which will always contain the correct
>list of instruction opcode values for the latest release. Note that that
>might not be correct for *your* release :)

This is not a good fix for people like me who may be a few versions back 
from the latest release from time to time.  This info should really be 
duplicated in the body of the doc.

> >
> > (4)     Something is messed up about the symbol table definitions.  The
> > number of entries and the type referred to by the symbol are correct, then
> > there is a third byte that was either a zero or a one in my simple
> > test.  This third byte is not described in the docs I don't believe.  Then
> > there is the size byte and the string.
>
>The extra byte you're referring to is probably the typeid for the value
>plane. It is correct and as designed.

D'OH!  Yes, I realized this after I decoded for a bit more.  One does need 
that value field...

>Because types and values got disassociated in 1.3 (Type no longer
>derives from Value), they are handled in separate data structures
>internally and written to the symbol table block separately. The
>documentation is only slightly deficient in this area as it didn't
>correctly identify the type of lists used for the types and value
>planes.  The documentation has been updated to correctly reflect the
>nature of the lists.
>
>Robert: if you could, please review:
>
>http://llvm.x10sys.com/llvm/docs/BytecodeFormat.html#symtab
>
>and let me know if it makes sense now.

IMHO still not clear.  The problem here is that too many things in this 
documentation are referred to as "slot number".  I know that's how the 
comments read in the code, but it's unclear.  The term is used both to 
describe an index into a type slot and the slot itself.  We need some 
clearer nomenclature here.

My suggestion would be to drop the word "slot" entirely and go with 
something like "type index" and "value index" for these two kinds of things.

Also, painful I know, I would split "symbol table entry" up into two 
sections, one for types, one for actual symbols, just so you can make the 
description of that first field in each crystal clear.

My 2c.

> > (5)     Labels used to have their own type.  If this is still the case, 
> its
> > not discussed in Reid's document.  It looks like the new type slot for
> > label is 12, the same as raw function.  Presumably this would be the 
> secret
> > type slot between the last primitive type (11) and the new start of the
> > defined types table (13).
>
>This is probably a result of the "Type != Value" change that happened in
>1.3. In 1.2, we had (in Type.h):

Yes.  This was one of those items that was buried back in the upgrade 
section.  Lazy skimmers like myself will get confused and ask about this.

>DoubleTyID = 11
>TypeTyID = 12
>LabelTyID = 13
>FunctionTyID = 14
>
>We now have:
>DoubleTyID = 11
>LabelTyID = 12
>FunctionTyID = 13
>
>So, we eliminated TypeTyID (one of the main points of the change was to
>get rid of this) and shifted down the values higher than it was before.
>So, the new value for a label is 12 and, as you noted, the derived types
>start at 13 now instead of 14. This is by design. The bytecode reader
>accommodates this change and will read 1.2 bytecode files correct even
>though all the primitive type IDs have changed.
>
>Sorry if this causes you grief, but its important for the design of our
>internal data structures that the type ids be contiguous. Hopefully this
>is the last change in this area for a long time. :)

No problem at all.  This was a 5 second change in my code.

>This change is briefly documented here:
>
>http://llvm.x10sys.com/llvm/docs/BytecodeFormat.html#vers12
>
>in the "Type Derives From Value" section.
>
> > If I were sorting these into bugs vs. doc changes for 1.31, I would fix 
> #1,
> > #2, and possibly #4 in the code depending on what the byte is used for.
>
>I don't think any code changes are needed except possibly to clean up
>#2.
>
>  The documentation changes have been made already.
>
> > It might be a really good bug hunting expedition before each release to
> > decode a few bytecode files by hand.  In my experience this is the only 
> way
> > to get physical protocols like this right.
>
>Agreed, but its pretty labor intensive.

I'll sign up for some of this.  I have a vested interest and independently 
written code that reads bytecodes and depends on this physical 
protocol.  Let me know when you make a significant change and I can see 
whether my bytecode reader chokes on anything.  Unfortunately, so far, its 
coverage is limited.  Also, over time it will become immune to checking 
some problems (for example, it now always rounds block sizes to the next 
nearest 32-bit boundary to cover the size of padding in all cases).

FYI, I found all the stuff mentioned here by compiling the simplest 
possible C program:

int main( void )
{
         return( 0 );
}

Test cases like this don't take too much time to look through by hand.

>What we do have a start on is some regression tests that ensure that at
>least backwards compatibility is retained. Its been on my "to do" list
>for a while now .. guess I should get it done. We have a regression test
>directory specifically set aside for this but we haven't checked in
>enough test cases yet.  This will get sorted out fairly soon since we
>don't want to have any issues going forward.
>
>As far as I know, the current LLVM bytecode reader will read 1.0, 1.1,
>1.2 and 1.3 format bytecode files. If you find otherwise, please submit
>a bug with your bytecode test case. I will fix any problems and add the
>test case to the regression tests.
>
> >
> > Hope this helps!
>
>Immensely. Its difficult documenting this in a vacuum, checking it with
>the code, looking at it with a hex dumper and "checking it twice". This
>kind of feedback is very valuable and I *really* appreciate it.
>
> > Thanks again to Reid for the great docs and to Chris and crew for the 
> whole
> > LLVM shebang.
>
>Sorry the docs weren't "perfect" for you. Hopefully they're better now.

Yes!  Thanks for all the clarifications and corrections!

Regards,

-- Robert.

Robert Mykland               Voice: (831) 462-6725
Founder/CTO                   Ascenium Corporation