[LLVMdev] More Encoding Ideas

Fri Aug 20 12:52:45 PDT 2004

Dear Chris and Reid:

Some other random ideas I've had as I've been sifting through the new 
bytecode format.  Please let me know what you think.

1) ANSI C allows for char to default to unsigned char.  This is I guess not 
how it normally is in GCC.  If char defaulted to unsigned char several 
things would be possible.  Single char constants that are defined would be 
almost always stored in one byte instead of the present usual two.  Also 
this would allow string constants to be stored in the constant table in the 
regular fashion without wasting bytes.  This would prevent the need to do 
those expensive linear searches through the type slot list in order to 
match a character constant with its type.

1a) If it's not feasible to make char default to unsigned, perhaps it would 
be possible to put all string constants at type slot zero to eliminate the 
linear searching.  I realize this would somewhat violate LLVM's strict 
typing rules, but since these are all constants their strict type could 
always be derived if needed.  Also, it would save space by eliminating the 
need to create a proliferating number of char array types of various lengths.

1b) Failing this, you should at least store the type with each constant 
string to avoid the linear searches.  This solution would add space, but 
save processing time, especially with large files with extensive type lists.

2) I think it would be a big file size and processing speed win to have 
implied pointer types for every literal type.  This would save a tremendous 
amount of space in the global type table and other places where pointer 
types are constantly being defined.  So the primitive types list would 
change to:

0       void
1       void* (implied)
2       bool
3       bool* (implied)
4       ubyte
5       ubyte* (implied)
6       sbyte
7       sbyte* (implied)
8       ushort
9       ushort* (implied)
etc.

This approach would have the added advantage of being able to check to see 
whether anything is a pointer type by checking bit 0 (1 = yes) and deriving 
its dereferenced type (just subtract 1).

3) Have the value index for labels start at 1, just like nonzero values of 
everything else does.  This just makes the encode/decode algorithm simpler 
and I doubt it would cost anything in file size.  I made this suggestion a 
few emails back, hopefully in a clearer form here.

4) Can files have multiple 0x01 headers?  I've never seen more than 
one.  If not, ditch this four bytes of unnecessary space per file.

5) Don't write the compaction table for a function if there are no 
entries.  All my simple examples have empty compaction tables that use up 8 
bytes per function.  This would save space.

I hope you find these suggestions helpful.  I'm committed to making LLVM 
bytecode as compact and as quick to encode/decode as possible.

Regards,

-- Robert.

Robert Mykland               Voice: (831) 462-6725
Founder/CTO                   Ascenium Corporation