[LLVMdev] Packages

Mon Nov 17 23:16:00 PST 2003

On Sun, 16 Nov 2003, Reid Spencer wrote:

> header file pre-compilation.  I want to load the _source_ AST for a
> given compiler very quickly, without revisiting the source code itself.

Gotcha.

> Sort of. What I'm thinking of is a section that it normally skips over
> (or, even better, never reaches because its at the end). However, the
> contents of that section would be interpretable by LLVM if someone asked
> for it. That is, the contents of the section contain constant type and
> variable definitions that are _not_ part of the executable program but
> are the _source_ description for the program. Those source descriptions
> are specified using regular LLVM Type and variable definitions but they
> don't factor into the program at all.  When a bytecode file is loaded,
> anything defined in such a section is just skipped over. When a compiler

Ok, this is all cool.

> or debugger asks for that section explicitly (the only way it gets
> accessed), LLVM would interpret the bytecodes and give back an instance
> of SymbolTable that only references Value and Type objects. These are
> the types and values that the compiler writer emitted to describe the
> _source_ and their semantics are up to the source compiler writer.

This isn't.  I don't understand exactly what you're talking about here.
What "Value" and "type" objects can there be if LLVM doesn't understand
it?  It seems to make more sense to me for the debugger or whatever to ask
for a named section, and get handed an _untyped block_ of binary data...

> > If you just want to do this _today_ you already can.  We have an
> > "appending" linkage type which can make this very simple.  Basically
> > global arrays with appending linkage automatically merge together when
> > bytecode files are linked (just like 'section' are merged in a traditional
> > linker).  If you want to implement your extra information using globals,
> > that is no problem, they will just always be loaded and processed.
>
> No. These _source_ descriptions are not to be loaded and processed ever
> except by explicit instruction from a compiler or debugger. For normal

Okay...

> program execution they are always ignored. Furthermore, they must NOT be
> merged unless you just mean concatenated into one big "source
> description" segment. I don't see much utility in that myself. If by

That's what I meant.  Assuming LLVM doesn't understand the contents of it,
all it can do is concatenate.

> merged you mean that commonly named global symbols are reduced to a
> single copy (like linkonce), then this defeats the point. What if a

I did mean appended.

> compiler wanted to emit a variable named "ModuleOptions" in each
> translation unit that describes the _source_ compiler options  used to
> compile the module. If those all get merged away, you lose the ability
> to distinguish different "ModuleOptions" for different modules. This is
> the reason for point #6.

I understand.

> > >      6. There are never any naming conflicts between compile time
> > >         information variables in different modules. Each compile time
> > >         global variable is, effectively, scoped in its module. This
> > >         allows compiler writers to use the same name for various pieces
> > >         of data in every module emitted without clashing.
> >
> > If you use the appending linkage mechanism, you _want_ them to have the
> > same name. :)
> No, you don't for the reason described above.  Is there a way to retain
> the unique identity of each of the variables when using appending
> linkage?

In the example above, the idea is that you would specify a binary blob of
data put into an LLVM global constant array of bytes.  The LLVM linker
would concatenate these arrays of bytes without having any idea how to
interpret the bytes.  It would be up to your compiler to be able to
interpret the meaning of the bytes and to be able to determine the
'identity of the variables' given the raw data.

> > >      7. The exact same facility for dealing with module scoped types and
> > >         variables are used to deal with the compile time information.
> > >         When asked for it, the VMCore would produce a SymbolTable that
> > >         references all the global types and variables in the compile
> > >         time information.
> >
> > If you use globals directly, you can just use the standard stuff.
>
> Perhaps, I'm unsure of the details but you'd need to somehow mark these
> globals as "not part of the program, never execute, ignore on load,
> fetch only if requested".

It would be straight-forward to make the JIT materialize globals only when
they are referenced.

> > >      8. LLVM assembler and bytecode reader will assure the syntactic
> > >         integrity of the compile time information as it would for any
> > >         other bytecode. It checks types, pointer references, etc. and
> > >         emits warnings (errors?) if the compiler information is not
> > >         syntactically valid.
> >
> > How does it do this if it doesn't understand it?  I thought it would just
> > pass it through unmodified?
>
> Read my statement carefully. I said "syntactic integrity" not semantics.
> LLVM would ensure that, within the compile time information (i.e. source
> description) there are (a) no references to undefined types, (b) no
> pointers to undefined symbols, (c) etc. These are all syntactic
> constructs that can be checked by LLVM without ever really understanding
> what the information in the compile time information actually _means_.
> That interpretation is left to the compiler writer.  This just gives the

So you mean it checks the LLVM types and LLVM variables?  I'm so confused,
I thought you were talking about source level stuff!  :)

> compiler writer some assurance that the content of the compile time
> information at least makes some structural sense. Furthermore, this
> information, even though it may represent a very complex data structure,
> is treated as a big constant. There can be no variable parts (despite me
> referencing this as "global variables" previously). There might, however
> be relocatable parts such as a reference to an actual function or global
> variable.

Ok, that is making more sense.  Yes, LLVM already supports this.

> > >      9. LLVM makes no assertions about the semantics or content of the
> > >         compile time information. It can be anything the compiler writer
> > >         wishes to express to retain compilation information. Correctness
> > >         of the information content (beyond syntactics) is left to the
> > >         compiler writer.  Exceptions to this rule may be warranted where
> >
> > This seems to contradict #8.

> Not really. You don't want LLVM to specify to _source_ language compiler
> writers what is and isn't valid semantically. In fact, you'd have a
> really hard time doing so. You'd end up with (conceptually) something
> like the GCC "tree" mess, trying to be all things to everyone. Why
> bother? Leave that to the compiler writer. You only want LLVM to check
> syntax/structure/referential integrity, etc.

Ok, I didn't understand what you meant by LLVM checking the structure but
not understanding the semantics.  You don't mean the structure _of the
data itself_, just that the LLVM view of it is ok.

> > >         there is general applicability to multiple source languages.
> > >         Debug (file & line number) info would seem to be a natural
> > >         exception.
> >
> > Note that debug information doesn't work with this model.  In particular,
> > when the LLVM optimizer transmogrifies the code, it has to update the
> > debug information to remain accurate.  This requires understanding (at
> > some level) the debug format.
>
> You're right. Debug information needs to be more closely aligned with
> the actual code in order for it to survive transformation. In fact, this
> raises some suspicions about the viability of my approach in general. If
> the source description information contains references to a function
> that gets eliminated because its never called, what happens? Same thing
> for types and variables at both global and function scope.

If a global has a pointer to a function, that function will never be
eliminated.  Likewise, things interprocedural constant propagation
(leading to deletion of arguments) will never happen.

> > There are two ways to implement this, as described above:
> >   1. Use global arrays of bytes or something.  If you want to, your arrays
> >      can even have pointers to globals variables and functions in them.
> >   2. Use an untyped blob of data, attached to the .bc file.
> >
> > #2 is better from the efficiency standpoint (it doesn't need to be loaded
> > if not used), but #1 is already fully implemented (it is used to implement
> > global ctor/dtors)...
>
> I don't think #1 works because of the naming clash issue and because it
> implies that these global arrays become part of the program. I
> explicitly want to forbid that because (at least in the case of XPL), I
> can imagine situations where the source description information is more
> voluminous than the actual program by an order of magnitude (its that
> way with debug "symbol" information today).

I understand exactly what you're saying.  Debug information in general has
this problem.  It's a very reasonable, and general, performance
optimization for the JIT to never materialize globals it doesn't need, so
this in and of itself isn't hard.  The hard part is that if you have
"external" pointers into the LLVM code, that those pointers will be
invalidated very quickly by general transformations.  Presumably you don't
want to handcuff the optimizer too much.

> What I want to do is emit the same named global variable (your "arrays
> of bytes or something") in each module to capture information about that
> module. For example, I want to emit a global array of structures that
> describes the types defined in the module. I want to call that global
> array "Types". If I do that in every module, what happens? I get a link
> time "duplicate symbol definition" error?

Yes.

> If I use appending linkage, I only get one of them?

No.  The elements of the array will be concatenated together, as described
in:
http://llvm.cs.uiuc.edu/docs/LangRef.html#modulestructure

> This is a disaster for this type of information. And, the name must
> remain constant across modules so that I can say, "load the compile time
> information for module X" and then "get variable "Types" from that
> compile time information. I can then peruse the type information for
> that module.  If I have to mangle the name in each module, that's a
> little unfriendly and error prone. Furthermore, I do NOT want this
> information to be part of the program. It isn't, it describes the
> program.

I understand.  This is exactly what appending linkage is for.

> If that approach is too cumbersome for LLVM, then I would vote for just
> the "blob" thing and leave it to each compiler writer to interpret the
> blob correctly.

This can certainly be done, but the problem is that random blobs on the
side will not be updated, and will be invalidated.

It seems to me that you're trying to address a problem semantically
equivalent to debug information, which I _want to directly address_, but
there are other more important things that need to be done first, as
prerequisites.  It is critically important to me to make the LLVM
transformations _implicitly_ update debug information as they do their
thing, without being aware of it.  Just like the symbol table is
implicitly always kept up-to-date.

Of course, doing this is not easy.  ;)

-Chris

-- 
http://llvm.cs.uiuc.edu/
http://www.nondot.org/~sabre/Projects/