[LLVMdev] Re: Bytecodes & docs

Tue Aug 17 17:25:18 PDT 2004

MOre feedback inline ...

Robert Mykland wrote:

> Reid,
> 
> Thanks for the detailed feedback.

Sure .. devil's in the details :)

> A value of zero now means zero literal for everything except labels, 
> right?  

Hmm. Not quite sure what you mean here. Zero values are used in quite a few 
places for various purposes. For example, the zlist will write a zero byte to 
terminate the list. In general a zero byte is only used to terminate some 
value. Zero corresponds to the "Null" type plane which we never emit nor any 
values of type "Null" so you won't see this as the type index for any value.

> There is kind of a vague reference to this in the 1.0 -> 1.1 
> section I believe. 

THe Version 1.0 and 1.1 bytecode formats were identical (bytecode format 1). 
You are probably referring to the differences between 1.1 and 1.2 in the 
"Explicit Primitive Zeros" section. This section refers only to the encoding of 
constant zero or null values for only the primitive types. The IR changes in 
1.2 to have constant "null" values for each primitive type, and pointer type. 
Consequently, there was no need to write these to the bytecode file any more. 
In some cases, this saved huge amounts of bytecode because zero initializers 
for large arrays of primitive type initialized to zero caused emitting a zero 
intiializer for every element of the array. THis is no longer done.

> You might want to make this clearer when talking 
> about values in the body of the document.

Could you suggest how?  I'm a little fuzzy on what you're getting at here.

> --> A comment on this: if a value of zero were never used for labels, 
> that would make me happy, because then my code could replace references 
> to zero with literal zero always and always subtract 1 from the value if 
> not zero to index into my type/value table.

I'm not sure what you mean by "if a value of zero were never used for labels". 
Are you referring to the type id (value=12), the slot number for the label 
(should only be one label with that slot number per function), or something 
else? Note that label values are not, per se, written to the bytecode. Instead 
we just give the labels name to the corresponding instruction in the symbol table.

> After reading through the upgrade sections, it seemed to me that there 
> are several things mentioned there that ought to also be mentioned in 
> the body of the bytecode document.  I admit I'm lazy, so I usually only 
> read upgrade sections of a doc when I'm busy upgrading to the next 
> version.  Here's a vote for making the docs more friendly to lazy 
> skimmers like myself.

Could you provide a list of the "several things mentioned there that ought to 
be mentioned in the body of the bytecode document"? I'm unclear on the 
specifics you're referring to. I'm happy to make this "skimmer friendly" but 
just not sure what you're getting at.

> 
> More comments below:
> 
> At 08:56 PM 8/16/2004, Robert Mykland wrote:
> 
>> From: Reid Spencer <reid at x10sys.com>
>>
>> On Mon, 2004-08-16 at 13:49, Robert Mykland wrote:
>> >
>> > I thought I should send this to the list in case anyone else is 
 >> > struggling to interpret bytecode files with the new docs.
>> >
>> > (1)     First a bug I already mentioned to Reid.  Unlike the other new
>> > module headers module 0x01 still uses the old 32-bit and 32-bit format
>> > instead of the new 5-bit and 27-bit format.  Thus the first module 
>> in the
>> > file will be 0x00000001 followed by a 32-bit size.
>>
>> This is a doc bug. Its fixed per my previous email.
> 
> Reid explained to me in seperate email that you need the extra five bits 
> in this particular case to support large executable files.  This makes a 
> lot of sense.
> 
> I had an idea about this.  If the size parameters represented 32-bit 
> words instead of bytes (since they're aligned anyway) this would allow 
> you to use the same format even for the first one.

I've thought about that, but the problem is that the blocks are not aligned any 
more. I'm not worried about bytecode files that exceed 2^32 bytes.

> Another way to save space with headers would be to make the combined 
> ID/size field a variable byte encoded field.

I thought of that that initially. Unfortunately, the "size" field is written at 
the end of the block so the number of bytes for the size has to be constant so 
we can fix it up once the block size is known.  The file format has to make 
sense for the writer too :)

>> > (3) In Reid's documentation, his "opcode" link is bad.  His doc does
>> > not yet contain an opcode section.  Presumably this would contain the info
>> > from the include file Instruction.def.
>>
>> This is also a doc bug. Its fixed by just referencing the
>> Instruction.def file on the cvsweb which will always contain the correct
>> list of instruction opcode values for the latest release. Note that that
>> might not be correct for *your* release :)
> 
> This is not a good fix for people like me who may be a few versions back 
> from the latest release from time to time.  This info should really be 
> duplicated in the body of the doc.

Okay, I see your point. If you'd care to submit a patch, I'll add it in. :) 
Otherwise, this will have to wait a bit until I can spend some time at it (I 
have to figure out which instructions go with which versions).

>> Because types and values got disassociated in 1.3 (Type no longer
>> derives from Value), they are handled in separate data structures
>> internally and written to the symbol table block separately. The
>> documentation is only slightly deficient in this area as it didn't
>> correctly identify the type of lists used for the types and value
>> planes.  The documentation has been updated to correctly reflect the
>> nature of the lists.
>>
>> Robert: if you could, please review:
>>
>> http://llvm.x10sys.com/llvm/docs/BytecodeFormat.html#symtab
>>
>> and let me know if it makes sense now.
> 
> IMHO still not clear.  The problem here is that too many things in this 
> documentation are referred to as "slot number".  I know that's how the 
> comments read in the code, but it's unclear.  The term is used both to 
> describe an index into a type slot and the slot itself.  We need some 
> clearer nomenclature here.
> 
> My suggestion would be to drop the word "slot" entirely and go with 
> something like "type index" and "value index" for these two kinds of 
> things.
> 
> Also, painful I know, I would split "symbol table entry" up into two 
> sections, one for types, one for actual symbols, just so you can make 
> the description of that first field in each crystal clear.
> 
> My 2c.

Okay. All are good suggestions. I'll look into it.

>> > (5) Labels used to have their own type.  If this is still the case, its
>> > not discussed in Reid's document.  It looks like the new type slot for
>> > label is 12, the same as raw function.  Presumably this would be the 
>> > secret type slot between the last primitive type (11) and the new start
>> > of the defined types table (13).
>>
>> This is probably a result of the "Type != Value" change that happened in
>> 1.3. In 1.2, we had (in Type.h):
> 
> 
> Yes.  This was one of those items that was buried back in the upgrade 
> section.  Lazy skimmers like myself will get confused and ask about this.

Should I move the differences section to the front of the document? Would that 
help?

>> Sorry if this causes you grief, but its important for the design of our
>> internal data structures that the type ids be contiguous. Hopefully this
>> is the last change in this area for a long time. :)
> 
> 
> No problem at all.  This was a 5 second change in my code.

Good. Glad the impact wasn't too bad. Its much worse in our reader/writer!

>> > It might be a really good bug hunting expedition before each release to
>> > decode a few bytecode files by hand.  In my experience this is the 
>> > only way to get physical protocols like this right.
>>
>> Agreed, but its pretty labor intensive.
> 
> I'll sign up for some of this.  I have a vested interest and 
> independently written code that reads bytecodes and depends on this 
> physical protocol.  Let me know when you make a significant change and I 
> can see whether my bytecode reader chokes on anything.  Unfortunately, 
> so far, its coverage is limited.  

Just watch the list. I prominently post on LLVMdev every bytecode change 
because the potential impact to everyone can be quite high.  I hate being the 
reason someone else's work got choked :)

> Also, over time it will become immune 
> to checking some problems (for example, it now always rounds block sizes 
> to the next nearest 32-bit boundary to cover the size of padding in all 
> cases).

Careful! In version 4 of the bytecode (to be released with LLVM 1.4), the 
blocks are no longer aligned. You're fine for version 3, but please make a note 
of this change for version 4 bytecode files.

> FYI, I found all the stuff mentioned here by compiling the simplest 
> possible C program:
> 
> int main( void )
> {
>         return( 0 );
> }
> 
> Test cases like this don't take too much time to look through by hand.

That's true, but they also don't produce interesting bytecode files that have 
corner cases that make for a good test. The one you sent us with a huge "main" 
will be added to our regression tests because it showed up the problem with 
alignment and some other issues.

Thanks for all your feedback, Robert. This really helps.

Reid.