[LLVMdev] More Encoding Ideas
Robert Mykland
robert at ascenium.com
Thu Aug 26 12:45:25 PDT 2004
At 09:37 PM 8/23/2004, you wrote:
>On Mon, 2004-08-23 at 19:46, Robert Mykland wrote:
> > At 06:43 PM 8/20/2004, Chris Lattner wrote:
> > >I don't understand what you're getting at here. You can change char to
> > >default to unsigned right now with llvm-gcc -funsigned-char. I don't
> > >understand how that would change anything to be more useful though.
> >
> > Well, in the old days, char strings were handled just like any other kind
> > of array of primitive types.
>
>And, they still are :)
No. If you define an array of int, the various int values that initialize
the table are defined seperately and then their value indexes are used in
the definition of the int array, which appears under its created type in
the constants list. Character strings are handled differently from this.
> > In that world, when char defaulted to signed
> > char, most of the heavily used ASCII symbols took two bytes to
> > encode.
>
>Um. What? ASCII is a 7-bit encoding. It defines values 0-127 which, even
>with a sign bit is encoded into one byte. Recall that in the "old days"
>computers had a parity bit as the 8th-bit because the memory failure
>rates were so high (think vacuum tubes).
Actually, by "old days" I meant LLVM version 0.9. In LLVM 0.9 they were
most often encoded as two bytes because most of the most commonly used
ASCII symbols are above 0x30.
> > Thus, (and I'm guessing here), you guys decided to treat char
> > strings as a special case to save space in the bytecode file.
>
>Actually, LLVM doesn't really treat character strings specially EXCEPT
>in the bcwriter and bcreader. There is no notion in LLVM of a "string",
>just primitive types and arrays of them. It is up to the front end
>compiler to define what it means by a "string". In the bytecode
>libraries of LLVM, we chose to interpret "[n x ubyte]" and "[n x sbyte]"
>as "strings" for reading and writing efficiency. They are, however,
>still just arrays of one of the two primitive single-byte types.
Okay, but this discussion is about the physical protocol of the
bytecode. That's what I'm referring to.
> > If all pointer types are implied, not a problem to create them. However,
> > in larger files it may cost a little due to slightly larger type
> > numbers. I'm not sure about the tradeoff here, but I expect that implied
> > pointers would still save more just because of pointers to function types.
>
>Pointers are used heavily in almost all languages. I can almost
>guarantee that the "tradeoff" would be larger bytecode files. The use of
>pointers to function types is not all that frequent so I wouldn't expect
>it to save much. In any event, we're not going to do anything with this
>until there are solid numbers. I'm working on improving llvm-bcanalyzer
>to provide them.
Right now I see pointer types being created for practically every literal
type defined anyway. I doubt you'd see much file bloat due to pointer
types being implied for everything. These pointer types are already being
defined.
However, I could see how it could conceivably save more file space to only
define pointer types where absolutely necessary, thus keeping the overall
number of types to an absolute minimum. Chris mentioned this philosophy
and I think it's a good one. Perhaps we could also find a way to declare
pointer types without having to declare the literal type if the literal
type is never used. Functions are but one example of this.
Regards,
-- Robert.
Robert Mykland Voice: (831) 462-6725
Founder/CTO Ascenium Corporation
More information about the llvm-dev
mailing list