[LLVMdev] More Encoding Ideas

Thu Aug 26 12:45:25 PDT 2004

At 09:37 PM 8/23/2004, you wrote:
>On Mon, 2004-08-23 at 19:46, Robert Mykland wrote:
> > At 06:43 PM 8/20/2004, Chris Lattner wrote:
> > >I don't understand what you're getting at here.  You can change char to
> > >default to unsigned right now with llvm-gcc -funsigned-char.  I don't
> > >understand how that would change anything to be more useful though.
> >
> > Well, in the old days, char strings were handled just like any other kind
> > of array of primitive types.
>
>And, they still are :)

No.  If you define an array of int, the various int values that initialize 
the table are defined seperately and then their value indexes are used in 
the definition of the int array, which appears under its created type in 
the constants list.  Character strings are handled differently from this.

> > In that world, when char defaulted to signed
> > char, most of the heavily used ASCII symbols took two bytes to
> > encode.
>
>Um. What? ASCII is a 7-bit encoding. It defines values 0-127 which, even
>with a sign bit is encoded into one byte. Recall that in the "old days"
>computers had a parity bit as the 8th-bit because the memory failure
>rates were so high (think vacuum tubes).

Actually, by "old days" I meant LLVM version 0.9.  In LLVM 0.9 they were 
most often encoded as two bytes because most of the most commonly used 
ASCII symbols are above 0x30.

> > Thus, (and I'm guessing here), you guys decided to treat char
> > strings as a special case to save space in the bytecode file.
>
>Actually, LLVM doesn't really treat character strings specially EXCEPT
>in the bcwriter and bcreader. There is no notion in LLVM of a "string",
>just primitive types and arrays of them. It is up to the front end
>compiler to define what it means by a "string". In the bytecode
>libraries of LLVM, we chose to interpret "[n x ubyte]" and "[n x sbyte]"
>as "strings" for reading and writing efficiency. They are, however,
>still just arrays of one of the two primitive single-byte types.

Okay, but this discussion is about the physical protocol of the 
bytecode.  That's what I'm referring to.

> > If all pointer types are implied, not a problem to create them.  However,
> > in larger files it may cost a little due to slightly larger type
> > numbers.  I'm not sure about the tradeoff here, but I expect that implied
> > pointers would still save more just because of pointers to function types.
>
>Pointers are used heavily in almost all languages. I can almost
>guarantee that the "tradeoff" would be larger bytecode files. The use of
>pointers to function types is not all that frequent so I wouldn't expect
>it to save much.  In any event, we're not going to do anything with this
>until there are solid numbers. I'm working on improving llvm-bcanalyzer
>to provide them.

Right now I see pointer types being created for practically every literal 
type defined anyway.  I doubt you'd see much file bloat due to pointer 
types being implied for everything.  These pointer types are already being 
defined.

However, I could see how it could conceivably save more file space to only 
define pointer types where absolutely necessary, thus keeping the overall 
number of types to an absolute minimum.  Chris mentioned this philosophy 
and I think it's a good one.  Perhaps we could also find a way to declare 
pointer types without having to declare the literal type if the literal 
type is never used.  Functions are but one example of this.

Regards,

-- Robert.

Robert Mykland               Voice: (831) 462-6725
Founder/CTO                   Ascenium Corporation