[LLVMdev] More Encoding Ideas

Fri Aug 20 17:39:35 PDT 2004

At 02:05 PM 8/20/2004, you wrote:
>Robert Mykland wrote:
>>Dear Chris and Reid:
>
>Hi Robert.
>
>>Some other random ideas I've had as I've been sifting through the new 
>>bytecode format.  Please let me know what you think.
>>1) ANSI C allows for char to default to unsigned char.  This is I guess 
>>not how it normally is in GCC.  If char defaulted to unsigned char 
>>several things would be possible.  Single char constants that are defined 
>>would be almost always stored in one byte instead of the present usual two.
>
>So, if I get you correctly, you're advocating the creation of a 
>Type::CharTyID in the TypeID enumeration that is always written as a 
>single byte? Note that right now all ASCII values ( <128 ) will be written 
>as a single byte for UByteTyID but for SByteTyID (often the default from 
>FE compilers like GCC), you're right, they'll take two bytes if the 
>value > 63.  Or are you saying that we should always write UByteTyID and 
>SByteTyID as a single byte?
>
>Long term, LLVM's distinction between signed and unsigned will go away. 
>Talk to Chris about that. :)

No, I didn't make myself clear.  When you do a string constant like:

printf( "hello, world!" );

...its is defined as type char*.  The ANSI standard for the C programming 
language allows for compilers to either default char to be the same as 
unsigned char OR char to be the same as signed char, as they appear to be 
currently in llvm-gcc.  Now this may be very difficult to change in gcc, I 
don't know, but if it could change, I'm noticing that it would make the 
bytecode files cleaner.

Even though I now understand how to decode these without searching the type 
table, it would be great if these constants could be treated like any other 
constants without a size penalty.  There would be little size penalty if 
char defaulted to unsigned char.

>   Also this would allow string constants to be stored in the
>>constant table in the regular fashion without wasting bytes.
>
>I don't think that happens today. We handle string constants (as opposed 
>to character constants) separately and they are always written as a length 
>followed by the characters in the string.
>
>
>This would
>>prevent the need to do those expensive linear searches through the type 
>>slot list in order to match a character constant with its type.
>
>hmm .. not clear on what you mean here. What "linear searches" ? THe slot 
>number indexes into the vector of types directly. I believe its O(1), not 
>O(n). Am I missing something here?

D'OH!  Never mind.  On the example I was looking at, the size of the string 
happened to match the number of the type slot and I got the two confused.

>>1a) If it's not feasible to make char default to unsigned, perhaps it 
>>would be possible to put all string constants at type slot zero to 
>>eliminate the linear searching.  I realize this would somewhat violate 
>>LLVM's strict typing rules, but since these are all constants their 
>>strict type could always be derived if needed.  Also, it would save space 
>>by eliminating the need to create a proliferating number of char array 
>>types of various lengths.
>
>Again, I'm not sure what you're getting at here. Have a look at the 
>function outputConstantStrings in lib/Bytecode/Writer/Writer.cpp. Strings 
>currently occupy the "void" type plane, plane 0 (which is a bit of a 
>hack), but we write out the type slot number (index), then the character 
>data, then a terminating null (0) byte. When the reader reads the constant 
>strings, it reads the slot number and then calls getType(slot) which just 
>indexes into the appropriate vector of strings. If you look at getType() 
>in lib/Bytecode/Reader/Reader.cpp I think you'll find there's no O(n) 
>searching there, just O(1) indexing.

Yep, I understand this now.

>>1b) Failing this, you should at least store the type with each constant 
>>string to avoid the linear searches.  This solution would add space, but 
>>save processing time, especially with large files with extensive type lists.
>
>Type type slot number is stored with each constant string, as described above.

Yep.

>>2) I think it would be a big file size and processing speed win to have 
>>implied pointer types for every literal type.  This would save a 
>>tremendous amount of space in the global type table and other places 
>>where pointer types are constantly being defined.  So the primitive types 
>>list would change to:
>>0       void
>>1       void* (implied)
>>2       bool
>>3       bool* (implied)
>>4       ubyte
>>5       ubyte* (implied)
>>6       sbyte
>>7       sbyte* (implied)
>>8       ushort
>>9       ushort* (implied)
>>etc.
>>This approach would have the added advantage of being able to check to 
>>see whether anything is a pointer type by checking bit 0 (1 = yes) and 
>>deriving its dereferenced type (just subtract 1).
>
>Interesting idea. However, I'm not convinced it saves us much. Most high 
>level languages will have lots of pointers to non-primitive types for 
>which this method won't work. Can you provide me with a bytecode file 
>where you can show that the bytecode is bloated by lots of pointers to 
>primitive types?

No, you're not getting the point.  The bytecode is bloated by pointers to 
all types.  Every time a type is defined pretty much a pointer is defined 
with it already, and since the pointer type is not implied by the 
definition of the literal, we waste two or more bytes whenever we define a 
type.  I'm advocating that every time we define a type an implied pointer 
type to it is also created.  We'll save two or more bytes for every type 
defined, including all function types.  That's hundreds of bytes for a big 
LLVM file.

>>3) Have the value index for labels start at 1, just like nonzero values 
>>of everything else does.  This just makes the encode/decode algorithm 
>>simpler and I doubt it would cost anything in file size.  I made this 
>>suggestion a few emails back, hopefully in a clearer form here.
>
>Like I replied, we don't store labels as values in LLVM. Labels are just 
>the names of basic blocks. Those names are stored in the function level 
>symbol table but there's no value slot indices for labels.  Perhaps I'm 
>missing something here? I'm still not exactly sure what you mean by "the 
>value index for labels". Do you mean its slot number? If so, I think 
>you're mistaken on the file format or the doc is misleading you. Please 
>let me know which so I can fix the doc. :)

The symbol table refers to a value index for each label.  This value index 
starts counting at zero, instead of one like for every other data type.  So 
my function that retrieves a thing based on its value index has to have a 
special case in it for labels.  That's what I'm referring to.

>>4) Can files have multiple 0x01 headers?  I've never seen more than 
>>one.  If not, ditch this four bytes of unnecessary space per file.
>
>I think the original plan was to have multiple modules in them but this 
>seems to have gone by the wayside. The result of linking two (or more) 
>modules is a single module so except in some really bizare corner cases 
>the need for multiple modules would go away. I suppose we could get rid of 
>the block id field for the file. I'll give this some thought and see if 
>Chris has any objections.
>
>Long term, I intend to write some kind of bytecode archive utility similar 
>to JAR files that contains multiple bytecode files, an index, and the 
>whole thing is compressed via bzip2. This would be the normal way that 
>libraries compiled with LLVM would be distributed. Note that LLVM bytecode 
>compreses with bzip2 around 50%.  We might even add a global symbol index 
>(optionally) so the things can be linked quickly without examining each 
>bytecode file in the archive. This is on my "TODO" list but at a lower 
>priority (for now).

I can see how multiple 0x01 modules in a single LLVM file would be handy if 
you wanted to easily maintain seperate module name spaces in debug versions 
of libraries.

Gosh, but at the very least, since it's a special case anyway, we could 
shrink this field down to a byte.

Come to think of it, you could still have multiple seperate LLVM modules in 
a file without this marker.  If there were still bytes in the file after 
the last byte specified by the module size, you'd be in another 
module.  Debug libraries is such a rare case that this wouldn't be too big 
a deal.

>>5) Don't write the compaction table for a function if there are no 
>>entries.  All my simple examples have empty compaction tables that use up 
>>8 bytes per function.  This would save space.
>
>Hmm. That's not supposed to happen. Have you got a bytecode and source 
>file that produces this? The algorithm in bcwriter for deciding whether to 
>emit the compaction table is pretty complicated. It should avoid the 
>compaction table any time there's no advantage to it space wise. The 
>compaction table's *sole* purpose is to save bytecode space and its BAD if 
>it gets used incorrectly.

It happens all the time!  I guess this is a bug.

Thanks for everything!

-- Robert.

Robert Mykland               Voice: (831) 462-6725
Founder/CTO                   Ascenium Corporation