[LLVMdev] More Encoding Ideas
Reid Spencer
reid at x10sys.com
Fri Aug 20 14:05:04 PDT 2004
Robert Mykland wrote:
> Dear Chris and Reid:
Hi Robert.
>
> Some other random ideas I've had as I've been sifting through the new
> bytecode format. Please let me know what you think.
>
> 1) ANSI C allows for char to default to unsigned char. This is I guess
> not how it normally is in GCC. If char defaulted to unsigned char
> several things would be possible. Single char constants that are
> defined would be almost always stored in one byte instead of the present
> usual two.
So, if I get you correctly, you're advocating the creation of a Type::CharTyID
in the TypeID enumeration that is always written as a single byte? Note that
right now all ASCII values ( <128 ) will be written as a single byte for
UByteTyID but for SByteTyID (often the default from FE compilers like GCC),
you're right, they'll take two bytes if the value > 63. Or are you saying that
we should always write UByteTyID and SByteTyID as a single byte?
Long term, LLVM's distinction between signed and unsigned will go away. Talk to
Chris about that. :)
Also this would allow string constants to be stored in the
> constant table in the regular fashion without wasting bytes.
I don't think that happens today. We handle string constants (as opposed to
character constants) separately and they are always written as a length
followed by the characters in the string.
This would
> prevent the need to do those expensive linear searches through the type
> slot list in order to match a character constant with its type.
hmm .. not clear on what you mean here. What "linear searches" ? THe slot
number indexes into the vector of types directly. I believe its O(1), not O(n).
Am I missing something here?
>
> 1a) If it's not feasible to make char default to unsigned, perhaps it
> would be possible to put all string constants at type slot zero to
> eliminate the linear searching. I realize this would somewhat violate
> LLVM's strict typing rules, but since these are all constants their
> strict type could always be derived if needed. Also, it would save
> space by eliminating the need to create a proliferating number of char
> array types of various lengths.
Again, I'm not sure what you're getting at here. Have a look at the function
outputConstantStrings in lib/Bytecode/Writer/Writer.cpp. Strings currently
occupy the "void" type plane, plane 0 (which is a bit of a hack), but we write
out the type slot number (index), then the character data, then a terminating
null (0) byte. When the reader reads the constant strings, it reads the slot
number and then calls getType(slot) which just indexes into the appropriate
vector of strings. If you look at getType() in lib/Bytecode/Reader/Reader.cpp I
think you'll find there's no O(n) searching there, just O(1) indexing.
>
> 1b) Failing this, you should at least store the type with each constant
> string to avoid the linear searches. This solution would add space, but
> save processing time, especially with large files with extensive type
> lists.
Type type slot number is stored with each constant string, as described above.
>
> 2) I think it would be a big file size and processing speed win to have
> implied pointer types for every literal type. This would save a
> tremendous amount of space in the global type table and other places
> where pointer types are constantly being defined. So the primitive
> types list would change to:
>
> 0 void
> 1 void* (implied)
> 2 bool
> 3 bool* (implied)
> 4 ubyte
> 5 ubyte* (implied)
> 6 sbyte
> 7 sbyte* (implied)
> 8 ushort
> 9 ushort* (implied)
> etc.
>
> This approach would have the added advantage of being able to check to
> see whether anything is a pointer type by checking bit 0 (1 = yes) and
> deriving its dereferenced type (just subtract 1).
Interesting idea. However, I'm not convinced it saves us much. Most high level
languages will have lots of pointers to non-primitive types for which this
method won't work. Can you provide me with a bytecode file where you can show
that the bytecode is bloated by lots of pointers to primitive types?
> 3) Have the value index for labels start at 1, just like nonzero values
> of everything else does. This just makes the encode/decode algorithm
> simpler and I doubt it would cost anything in file size. I made this
> suggestion a few emails back, hopefully in a clearer form here.
Like I replied, we don't store labels as values in LLVM. Labels are just the
names of basic blocks. Those names are stored in the function level symbol
table but there's no value slot indices for labels. Perhaps I'm missing
something here? I'm still not exactly sure what you mean by "the value index
for labels". Do you mean its slot number? If so, I think you're mistaken on the
file format or the doc is misleading you. Please let me know which so I can fix
the doc. :)
> 4) Can files have multiple 0x01 headers? I've never seen more than
> one. If not, ditch this four bytes of unnecessary space per file.
I think the original plan was to have multiple modules in them but this seems
to have gone by the wayside. The result of linking two (or more) modules is a
single module so except in some really bizare corner cases the need for
multiple modules would go away. I suppose we could get rid of the block id
field for the file. I'll give this some thought and see if Chris has any
objections.
Long term, I intend to write some kind of bytecode archive utility similar to
JAR files that contains multiple bytecode files, an index, and the whole thing
is compressed via bzip2. This would be the normal way that libraries compiled
with LLVM would be distributed. Note that LLVM bytecode compreses with bzip2
around 50%. We might even add a global symbol index (optionally) so the things
can be linked quickly without examining each bytecode file in the archive. This
is on my "TODO" list but at a lower priority (for now).
>
> 5) Don't write the compaction table for a function if there are no
> entries. All my simple examples have empty compaction tables that use
> up 8 bytes per function. This would save space.
Hmm. That's not supposed to happen. Have you got a bytecode and source file
that produces this? The algorithm in bcwriter for deciding whether to emit the
compaction table is pretty complicated. It should avoid the compaction table
any time there's no advantage to it space wise. The compaction table's *sole*
purpose is to save bytecode space and its BAD if it gets used incorrectly.
>
> I hope you find these suggestions helpful.
Absolutely! You have some interesting ideas here.
> I'm committed to making LLVM
> bytecode as compact and as quick to encode/decode as possible.
Thanks, we appreciate that a lot. Its high on our agenda too. Everything we can
do to make the bytecode faster/smaller has enormous impacts downstream
(considering linking 1000s of bc files).
Reid.
More information about the llvm-dev
mailing list