[LLVMdev] Packages

Sun Nov 16 19:42:01 PST 2003

On Sun, 2003-11-16 at 17:13, Chris Lattner wrote:
> > The point here is that XPL needs to keep track of what a given variable
> > represents at the source level. If the compiler sees a map that is
> > initially small it might represent it in LLVM assembly as a vector of
> > pairs. Later on, it gets optimized into being a hash table. In order to
> > do that and keep track of things, I need to know that the vector of
> > pairs is >intended< to be a map, not simply a vector of pairs.
> 
> Absolutely.  No matter what source language you're interested in, you want
> to know about _source_ variables/types/etc, not about LLVM varaibles,
> types, etc.
Right.
> 
> > Another reason to do this is to speed up compilation time. XPL works
> > similarly to Java in that you define a module and "import" other modules
> > into it.  I do not want to recompile a module each time it is imported.
> 
> Makes sense . On the LLVM side of the fence, we are planning on making the
> JIT cache native translations, so you only need to pay the translation
> cost the first time a function is executed.  This is also plays into the
> 'offline compilation' idea as well.
I had assumed as much but I think I'm talking about something different.
When I said "I do not want to recompile a module each time it is
imported", I meant recompile in order to get the _source_ language
descriptions only. I wouldn't recompile to get the byte codes to be
executed because (presumably) those are already available as you noted. 
For example, if module A imports module B, I want to be able to just
instantaneously load from B the definitions of types, constants, global
variables and functions, as specified in the _source_ language without
going back to the _source_ and recompiling it to regenerate the
information.  If we were in the C/C++ world, this would be more akin to
header file pre-compilation.  I want to load the _source_ AST for a
given compiler very quickly, without revisiting the source code itself. 
> 
> > Since finding LLVM, I'm wondering if it wouldn't be better to store all
> > the AST information in the bytecode file so that I don't have
> > compilation information in one place and the code for it in another.
> > To do this, I'd need support from LLVM to put "compile time information"
> > into a bytecode or assembly file. This information would never be used
> > at runtime and never "optimized out". It just sits in the bytecode file
> > taking up space until some compiler (or other tool) asks for it.
> 
> Makes sense.   The LLVM bytecode file is packetized to specifically
> support these kinds of applications.  The bytecode reader can skip over
> sections it doesn't understand.  The unimplemented part is figuring out a
> format to put this into the .ll file (probably just a hex dump or
> something), and having the compiler preserve it through optimization.

Sort of. What I'm thinking of is a section that it normally skips over
(or, even better, never reaches because its at the end). However, the
contents of that section would be interpretable by LLVM if someone asked
for it. That is, the contents of the section contain constant type and
variable definitions that are _not_ part of the executable program but
are the _source_ description for the program. Those source descriptions
are specified using regular LLVM Type and variable definitions but they
don't factor into the program at all.  When a bytecode file is loaded,
anything defined in such a section is just skipped over. When a compiler
or debugger asks for that section explicitly (the only way it gets
accessed), LLVM would interpret the bytecodes and give back an instance
of SymbolTable that only references Value and Type objects. These are
the types and values that the compiler writer emitted to describe the
_source_ and their semantics are up to the source compiler writer.
> 
> >      5. Compile time information is defined as a set of global variables
> >         just the same as for the runtime definitions. The full use of
> >         LLVM Types (especially derived types like structures and
> >         pointers) can be used to define the global variables.
> 
> If you just want to do this _today_ you already can.  We have an
> "appending" linkage type which can make this very simple.  Basically
> global arrays with appending linkage automatically merge together when
> bytecode files are linked (just like 'section' are merged in a traditional
> linker).  If you want to implement your extra information using globals,
> that is no problem, they will just always be loaded and processed.

No. These _source_ descriptions are not to be loaded and processed ever
except by explicit instruction from a compiler or debugger. For normal
program execution they are always ignored. Furthermore, they must NOT be
merged unless you just mean concatenated into one big "source
description" segment. I don't see much utility in that myself. If by
merged you mean that commonly named global symbols are reduced to a
single copy (like linkonce), then this defeats the point. What if a
compiler wanted to emit a variable named "ModuleOptions" in each
translation unit that describes the _source_ compiler options  used to
compile the module. If those all get merged away, you lose the ability
to distinguish different "ModuleOptions" for different modules. This is
the reason for point #6.
> 
> >      6. There are never any naming conflicts between compile time
> >         information variables in different modules. Each compile time
> >         global variable is, effectively, scoped in its module. This
> >         allows compiler writers to use the same name for various pieces
> >         of data in every module emitted without clashing.
> 
> If you use the appending linkage mechanism, you _want_ them to have the
> same name. :)
No, you don't for the reason described above.  Is there a way to retain
the unique identity of each of the variables when using appending
linkage?
> 
> >      7. The exact same facility for dealing with module scoped types and
> >         variables are used to deal with the compile time information.
> >         When asked for it, the VMCore would produce a SymbolTable that
> >         references all the global types and variables in the compile
> >         time information.
> 
> If you use globals directly, you can just use the standard stuff.

Perhaps, I'm unsure of the details but you'd need to somehow mark these
globals as "not part of the program, never execute, ignore on load,
fetch only if requested".

> 
> >      8. LLVM assembler and bytecode reader will assure the syntactic
> >         integrity of the compile time information as it would for any
> >         other bytecode. It checks types, pointer references, etc. and
> >         emits warnings (errors?) if the compiler information is not
> >         syntactically valid.
> 
> How does it do this if it doesn't understand it?  I thought it would just
> pass it through unmodified?

Read my statement carefully. I said "syntactic integrity" not semantics.
LLVM would ensure that, within the compile time information (i.e. source
description) there are (a) no references to undefined types, (b) no
pointers to undefined symbols, (c) etc. These are all syntactic
constructs that can be checked by LLVM without ever really understanding
what the information in the compile time information actually _means_.
That interpretation is left to the compiler writer.  This just gives the
compiler writer some assurance that the content of the compile time
information at least makes some structural sense. Furthermore, this
information, even though it may represent a very complex data structure,
is treated as a big constant. There can be no variable parts (despite me
referencing this as "global variables" previously). There might, however
be relocatable parts such as a reference to an actual function or global
variable.

> 
> >      9. LLVM makes no assertions about the semantics or content of the
> >         compile time information. It can be anything the compiler writer
> >         wishes to express to retain compilation information. Correctness
> >         of the information content (beyond syntactics) is left to the
> >         compiler writer.  Exceptions to this rule may be warranted where
> 
> This seems to contradict #8.
Not really. You don't want LLVM to specify to _source_ language compiler
writers what is and isn't valid semantically. In fact, you'd have a
really hard time doing so. You'd end up with (conceptually) something
like the GCC "tree" mess, trying to be all things to everyone. Why
bother? Leave that to the compiler writer. You only want LLVM to check
syntax/structure/referential integrity, etc. 
> 
> >         there is general applicability to multiple source languages.
> >         Debug (file & line number) info would seem to be a natural
> >         exception.
> 
> Note that debug information doesn't work with this model.  In particular,
> when the LLVM optimizer transmogrifies the code, it has to update the
> debug information to remain accurate.  This requires understanding (at
> some level) the debug format.

You're right. Debug information needs to be more closely aligned with
the actual code in order for it to survive transformation. In fact, this
raises some suspicions about the viability of my approach in general. If
the source description information contains references to a function
that gets eliminated because its never called, what happens? Same thing
for types and variables at both global and function scope.

>>> I'm off to do some serious thinking about this proposal :( <<<

> 
> >     10. Compile time information sections are marked with a name that
> >         relates to the high-level compiler that produced them. This
> >         avoids confusion when one language attempts to read the compile
> >         time information of another language.
> >
> > This is somewhat like an open ended, generalized ELF section for keeping
> > track of compiler and/or debug information.  Because its based on
> > existing capabilities of LLVM, I don't think it would be particularly
> > difficult to implement either.
> 
> There are two ways to implement this, as described above:
>   1. Use global arrays of bytes or something.  If you want to, your arrays
>      can even have pointers to globals variables and functions in them.
>   2. Use an untyped blob of data, attached to the .bc file.
> 
> #2 is better from the efficiency standpoint (it doesn't need to be loaded
> if not used), but #1 is already fully implemented (it is used to implement
> global ctor/dtors)...

I don't think #1 works because of the naming clash issue and because it
implies that these global arrays become part of the program. I
explicitly want to forbid that because (at least in the case of XPL), I
can imagine situations where the source description information is more
voluminous than the actual program by an order of magnitude (its that
way with debug "symbol" information today).  

What I want to do is emit the same named global variable (your "arrays
of bytes or something") in each module to capture information about that
module. For example, I want to emit a global array of structures that
describes the types defined in the module. I want to call that global
array "Types". If I do that in every module, what happens? I get a link
time "duplicate symbol definition" error? If I use appending linkage, I
only get one of them? This is a disaster for this type of information.
And, the name must remain constant across modules so that I can say,
"load the compile time information for module X" and then "get variable
"Types" from that compile time information. I can then peruse the type
information for that module.  If I have to mangle the name in each
module, that's a little unfriendly and error prone. Furthermore, I do
NOT want this information to be part of the program. It isn't, it
describes the program.  

As such, your point #2 must be accommodated. The blob of data is
normally skipped when the program is executed. But, when it is
requested, that blob of data isn't just returned to the compiler as a
blob. Because it represents a constant graph of types and values, LLVM
first checks its integrity, then instantiates the necessary C++ objects
to represent it and places them into a symbol table which is returned to
the compiler. This means the compiler can quickly look up source
descriptions in that module.

If that approach is too cumbersome for LLVM, then I would vote for just
the "blob" thing and leave it to each compiler writer to interpret the
blob correctly.

Make sense?

> -Chris

Reid.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20031116/0a5717c1/attachment.sig>