[LLVMdev] More Encoding Ideas

Fri Aug 20 14:05:04 PDT 2004

Robert Mykland wrote:
> Dear Chris and Reid:

Hi Robert.

> 
> Some other random ideas I've had as I've been sifting through the new 
> bytecode format.  Please let me know what you think.
> 
> 1) ANSI C allows for char to default to unsigned char.  This is I guess 
> not how it normally is in GCC.  If char defaulted to unsigned char 
> several things would be possible.  Single char constants that are 
> defined would be almost always stored in one byte instead of the present 
> usual two.

So, if I get you correctly, you're advocating the creation of a Type::CharTyID 
in the TypeID enumeration that is always written as a single byte? Note that 
right now all ASCII values ( <128 ) will be written as a single byte for 
UByteTyID but for SByteTyID (often the default from FE compilers like GCC), 
you're right, they'll take two bytes if the value > 63.  Or are you saying that 
we should always write UByteTyID and SByteTyID as a single byte?

Long term, LLVM's distinction between signed and unsigned will go away. Talk to 
Chris about that. :)

   Also this would allow string constants to be stored in the
> constant table in the regular fashion without wasting bytes.  

I don't think that happens today. We handle string constants (as opposed to 
character constants) separately and they are always written as a length 
followed by the characters in the string.

This would
> prevent the need to do those expensive linear searches through the type 
> slot list in order to match a character constant with its type.

hmm .. not clear on what you mean here. What "linear searches" ? THe slot 
number indexes into the vector of types directly. I believe its O(1), not O(n). 
Am I missing something here?

> 
> 1a) If it's not feasible to make char default to unsigned, perhaps it 
> would be possible to put all string constants at type slot zero to 
> eliminate the linear searching.  I realize this would somewhat violate 
> LLVM's strict typing rules, but since these are all constants their 
> strict type could always be derived if needed.  Also, it would save 
> space by eliminating the need to create a proliferating number of char 
> array types of various lengths.

Again, I'm not sure what you're getting at here. Have a look at the function 
outputConstantStrings in lib/Bytecode/Writer/Writer.cpp. Strings currently 
occupy the "void" type plane, plane 0 (which is a bit of a hack), but we write 
out the type slot number (index), then the character data, then a terminating 
null (0) byte. When the reader reads the constant strings, it reads the slot 
number and then calls getType(slot) which just indexes into the appropriate 
vector of strings. If you look at getType() in lib/Bytecode/Reader/Reader.cpp I 
think you'll find there's no O(n) searching there, just O(1) indexing.

> 
> 1b) Failing this, you should at least store the type with each constant 
> string to avoid the linear searches.  This solution would add space, but 
> save processing time, especially with large files with extensive type 
> lists.

Type type slot number is stored with each constant string, as described above.

> 
> 2) I think it would be a big file size and processing speed win to have 
> implied pointer types for every literal type.  This would save a 
> tremendous amount of space in the global type table and other places 
> where pointer types are constantly being defined.  So the primitive 
> types list would change to:
> 
> 0       void
> 1       void* (implied)
> 2       bool
> 3       bool* (implied)
> 4       ubyte
> 5       ubyte* (implied)
> 6       sbyte
> 7       sbyte* (implied)
> 8       ushort
> 9       ushort* (implied)
> etc.
> 
> This approach would have the added advantage of being able to check to 
> see whether anything is a pointer type by checking bit 0 (1 = yes) and 
> deriving its dereferenced type (just subtract 1).

Interesting idea. However, I'm not convinced it saves us much. Most high level 
languages will have lots of pointers to non-primitive types for which this 
method won't work. Can you provide me with a bytecode file where you can show 
that the bytecode is bloated by lots of pointers to primitive types?

> 3) Have the value index for labels start at 1, just like nonzero values 
> of everything else does.  This just makes the encode/decode algorithm 
> simpler and I doubt it would cost anything in file size.  I made this 
> suggestion a few emails back, hopefully in a clearer form here.

Like I replied, we don't store labels as values in LLVM. Labels are just the 
names of basic blocks. Those names are stored in the function level symbol 
table but there's no value slot indices for labels.  Perhaps I'm missing 
something here? I'm still not exactly sure what you mean by "the value index 
for labels". Do you mean its slot number? If so, I think you're mistaken on the 
file format or the doc is misleading you. Please let me know which so I can fix 
the doc. :)

> 4) Can files have multiple 0x01 headers?  I've never seen more than 
> one.  If not, ditch this four bytes of unnecessary space per file.

I think the original plan was to have multiple modules in them but this seems 
to have gone by the wayside. The result of linking two (or more) modules is a 
single module so except in some really bizare corner cases the need for 
multiple modules would go away. I suppose we could get rid of the block id 
field for the file. I'll give this some thought and see if Chris has any 
objections.

Long term, I intend to write some kind of bytecode archive utility similar to 
JAR files that contains multiple bytecode files, an index, and the whole thing 
is compressed via bzip2. This would be the normal way that libraries compiled 
with LLVM would be distributed. Note that LLVM bytecode compreses with bzip2 
around 50%.  We might even add a global symbol index (optionally) so the things 
can be linked quickly without examining each bytecode file in the archive. This 
is on my "TODO" list but at a lower priority (for now).

> 
> 5) Don't write the compaction table for a function if there are no 
> entries.  All my simple examples have empty compaction tables that use 
> up 8 bytes per function.  This would save space.

Hmm. That's not supposed to happen. Have you got a bytecode and source file 
that produces this? The algorithm in bcwriter for deciding whether to emit the 
compaction table is pretty complicated. It should avoid the compaction table 
any time there's no advantage to it space wise. The compaction table's *sole* 
purpose is to save bytecode space and its BAD if it gets used incorrectly.

> 
> I hope you find these suggestions helpful.  

Absolutely! You have some interesting ideas here.

> I'm committed to making LLVM 
> bytecode as compact and as quick to encode/decode as possible.

Thanks, we appreciate that a lot. Its high on our agenda too. Everything we can 
do to make the bytecode faster/smaller has enormous impacts downstream 
(considering linking 1000s of bc files).

Reid.