[LLVMdev] More Encoding Ideas

Thu Aug 26 13:11:43 PDT 2004

On Thu, 2004-08-26 at 12:45, Robert Mykland wrote:
> At 09:37 PM 8/23/2004, you wrote:
> >On Mon, 2004-08-23 at 19:46, Robert Mykland wrote:
> > > At 06:43 PM 8/20/2004, Chris Lattner wrote:
> > > >I don't understand what you're getting at here.  You can change char to
> > > >default to unsigned right now with llvm-gcc -funsigned-char.  I don't
> > > >understand how that would change anything to be more useful though.
> > >
> > > Well, in the old days, char strings were handled just like any other kind
> > > of array of primitive types.
> >
> >And, they still are :)
> 
> No.  If you define an array of int, the various int values that initialize 
> the table are defined seperately and then their value indexes are used in 
> the definition of the int array, which appears under its created type in 
> the constants list.  Character strings are handled differently from this.

Okay, I see what you mean now. Yes, bytecode treats character arrays
(well, to be precise arrays of UByte or SByte) separately from arrays of
other primitive types.

> 
> 
> > > In that world, when char defaulted to signed
> > > char, most of the heavily used ASCII symbols took two bytes to
> > > encode.
> >
> >Um. What? ASCII is a 7-bit encoding. It defines values 0-127 which, even
> >with a sign bit is encoded into one byte. Recall that in the "old days"
> >computers had a parity bit as the 8th-bit because the memory failure
> >rates were so high (think vacuum tubes).
> 
> Actually, by "old days" I meant LLVM version 0.9.  In LLVM 0.9 they were 
> most often encoded as two bytes because most of the most commonly used 
> ASCII symbols are above 0x30.

Okay. LLVM 0.9 predates my involvement so I wouldn't know :)

> 
> > > Thus, (and I'm guessing here), you guys decided to treat char
> > > strings as a special case to save space in the bytecode file.
> >
> >Actually, LLVM doesn't really treat character strings specially EXCEPT
> >in the bcwriter and bcreader. There is no notion in LLVM of a "string",
> >just primitive types and arrays of them. It is up to the front end
> >compiler to define what it means by a "string". In the bytecode
> >libraries of LLVM, we chose to interpret "[n x ubyte]" and "[n x sbyte]"
> >as "strings" for reading and writing efficiency. They are, however,
> >still just arrays of one of the two primitive single-byte types.
> 
> Okay, but this discussion is about the physical protocol of the 
> bytecode.  That's what I'm referring to.

Okay.

> 
> > > If all pointer types are implied, not a problem to create them.  However,
> > > in larger files it may cost a little due to slightly larger type
> > > numbers.  I'm not sure about the tradeoff here, but I expect that implied
> > > pointers would still save more just because of pointers to function types.
> >
> >Pointers are used heavily in almost all languages. I can almost
> >guarantee that the "tradeoff" would be larger bytecode files. The use of
> >pointers to function types is not all that frequent so I wouldn't expect
> >it to save much.  In any event, we're not going to do anything with this
> >until there are solid numbers. I'm working on improving llvm-bcanalyzer
> >to provide them.
> 
> Right now I see pointer types being created for practically every literal 
> type defined anyway.  I doubt you'd see much file bloat due to pointer 
> types being implied for everything.  These pointer types are already being 
> defined.

The bloat doesn't occur because of the extra type definitions. A pointer
type definition is incredibly small (basically a VBR number or 1-2 bytes
to reference the element type). The bloat occurs because of the
automatic doubling of the slot numbers. Say you had 100 non-pointer
types of which 15 needed pointers for a total of 115. If we
automatically made even indices in the type table be non-pointer types
and odd indices be pointer-types, we'd have 200 entries in the table.
Now the entries above 127 all require 2 VBR bytes to reference instead
of just 1 as per the current mechanism. If those higher slot number
types are used frequently, this could add serious bloat to the file.
What we *really* want to do here is (a) only define types that are used
and (b) sort the types by frequency of use so that the most frequently
used types have the smallest slot numbers. This will ensure the most
compact encoding. 

You should try out llvm-bcanalyzer sometime. It will show you that on
average the size of the type table is about 1% of the file size wheras
the size of the instructions is around 90% (unless there's a lot of
symbols in which case the symbol table rivals the instruction lists).

> 
> However, I could see how it could conceivably save more file space to only 
> define pointer types where absolutely necessary, thus keeping the overall 
> number of types to an absolute minimum.  Chris mentioned this philosophy 
> and I think it's a good one.  Perhaps we could also find a way to declare 
> pointer types without having to declare the literal type if the literal 
> type is never used.  Functions are but one example of this.

True 'nuff.

Reid
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20040826/6b11dfaf/attachment.sig>