[LLVMdev] More Encoding Ideas

Fri Aug 20 17:55:02 PDT 2004

At 05:09 PM 8/20/2004, you wrote:
>On Fri, 20 Aug 2004, Reid Spencer wrote:
> > > defined would be almost always stored in one byte instead of the present
> > > usual two.
> >
> > So, if I get you correctly, you're advocating the creation of a 
> Type::CharTyID
> > in the TypeID enumeration that is always written as a single byte? Note 
> that
> > right now all ASCII values ( <128 ) will be written as a single byte for
> > UByteTyID but for SByteTyID (often the default from FE compilers like GCC),
> > you're right, they'll take two bytes if the value > 63.  Or are you 
> saying that
> > we should always write UByteTyID and SByteTyID as a single byte?
> >
> > Long term, LLVM's distinction between signed and unsigned will go away. 
> Talk to
> > Chris about that. :)
>
>If you're interested in the plans, they are described in some detail here:
>http://nondot.org/sabre/LLVMNotes/TypeSystemChanges.txt
>
>Note that there is no concrete timeline for this to happen, it basically
>depends on when someone is ambitious enough to start working on it.
>
>In any case, both signed and unsigned 8-bit constants can be written out
>in a single byte.  Again, do you think it's worth special casing this
>though?  Considering that we handle 8-bit strings specially already, there
>are not a ton of 8-bit constants with value >= 128.

I'd rather that they not be treated specially.  If char defaulted to 
unsigned char, there would be little reason to create this special case.

> > > 2) I think it would be a big file size and processing speed win to have
> > > implied pointer types for every literal type.  This would save a
> > > tremendous amount of space in the global type table and other places
> > > where pointer types are constantly being defined.  So the primitive
> > > types list would change to:
> > >
> > > 0       void
> > > 1       void* (implied)
>
>This is a very interesting idea, particularly for languages like C++ that
>have a ton of types.  Before making this change, I would want to see some
>numbers though.  In particular, I don't think that types typically take up
>a large amount of the .bc file size: most of it are instructions.
>
>Are you seeing other cases?

No.  This would only save a bit less than two bytes per primitive and 
defined type.  Maybe a few hundred bytes in a large LLVM file.  Not a big 
savings, but a savings.  The thing I like is that along with the size 
savings it appears to make the encode/decode simpler and quicker if 
anything.  So good news all around.

> > > This approach would have the added advantage of being able to check to
> > > see whether anything is a pointer type by checking bit 0 (1 = yes) and
> > > deriving its dereferenced type (just subtract 1).
>
>I don't think this is a big win, the .bc reader doesn't have to do much of
>this.

I know my reader does this.  I'm not really sure how much time it spends 
doing it.  My little code generator spends a lot of time going back and 
forth between pointers and literal values when turning certain kinds of 
memory operations into data movement in the Ascenium array.

> > > 3) Have the value index for labels start at 1, just like nonzero values
> > > of everything else does.  This just makes the encode/decode algorithm
> > > simpler and I doubt it would cost anything in file size.  I made this
> > > suggestion a few emails back, hopefully in a clearer form here.
> >
> > Like I replied, we don't store labels as values in LLVM. Labels are 
> just the
> > names of basic blocks. Those names are stored in the function level symbol
>
>I think that Robert's point is that this would remove a special case from
>the code (which is good).  I'm indifferent about the change: if some other
>changes are made to the .bc file format, this could go in as well.

Cool.

> > > 4) Can files have multiple 0x01 headers?  I've never seen more than
> > > one.  If not, ditch this four bytes of unnecessary space per file.
> >
> > I think the original plan was to have multiple modules in them but this 
> seems
> > to have gone by the wayside. The result of linking two (or more) 
> modules is a
> > single module so except in some really bizare corner cases the need for
> > multiple modules would go away. I suppose we could get rid of the block id
> > field for the file. I'll give this some thought and see if Chris has any
> > objections.
>
>I don't have any problem with removing it.

Cool. Before you chop remember debug libraries.

> > Long term, I intend to write some kind of bytecode archive utility 
> similar to
> > JAR files that contains multiple bytecode files, an index, and the 
> whole thing
>
>Sounds like a cool thing.  If you did this, make sure that llvm-nm could
>read the files (of course), and, if/when you do this, you could make the
>interface be llvm-ar (which was never finished).

Seconded!

Regards,

-- Robert.

Robert Mykland               Voice: (831) 462-6725
Founder/CTO                   Ascenium Corporation