[LLVMdev] Packages
Reid Spencer
reid at x10sys.com
Sun Nov 16 18:42:01 PST 2003
On Sun, 2003-11-16 at 13:01, Vipin Gokhale wrote:
> While on the subject of annotating bytecode with analysis info, could I
> entice someone to also think about carrying other types of source-level
> annotations through into bytecode ? This is particularly useful for
> situations where one wants to use LLVM infrastructure for its
> whole-program optimization capabilities, however wouldn't want to give
> up on the ability to debug the final product binary. At the moment, my
> understanding is that source code annotations like file names, line
> numbers etc isn't carried through. When one gets around to linking the
> whole program, you end up with a single .s file of native machine code
> (which by now is a giant collection of bits picked up from a multitude
> of source files) with no ability to do symbolic debugging on the
> resulting binary...
I whole heartedly second that motion.
My purposes are a little different, however. The language for which I'm
compiling (XPL) is fairly high level. For example, data structures such
as hash tables and red black trees are simply referenced as "maps" which
map one type to another. What exact data structure is used underneath is
up to the compiler and runtime optimizer, even allowing transformation
of the underlying type at runtime. For example, a map that initially
contains 3 elements would probably just be a vector of pairs because its
pretty straight forward to linearly scan a small table and it is space
efficient. But, as the map grows in size, it might transform itself into
a sorted vector so binary search can be used and then into a hash table
to reduce the overhead of searching further and then again later on into
a full red-black tree. Of course, all of this depends on whether
insertions and deletions are more frequent than look ups, etc.
The point here is that XPL needs to keep track of what a given variable
represents at the source level. If the compiler sees a map that is
initially small it might represent it in LLVM assembly as a vector of
pairs. Later on, it gets optimized into being a hash table. In order to
do that and keep track of things, I need to know that the vector of
pairs is >intended< to be a map, not simply a vector of pairs.
Another reason to do this is to speed up compilation time. XPL works
similarly to Java in that you define a module and "import" other modules
into it. I do not want to recompile a module each time it is imported.
I'd rather just save the static portion of the syntax tree (i.e. the
declarations) somewhere and load it en masse when its referenced in
another compilation. Currently, I have a partially implemented solution
for this based on my persistent memory module (like an object database
for C++ that allows you to save graphs of objects onto disk via virtual
memory management tricks). When a module is referenced in an import
statement, its disk segment is located and mapped into memory in one
shot .. no parsing, no linking together, just instantly available. For
large software projects with 1000s of modules, this is a HUGE
compilation time win.
Since finding LLVM, I'm wondering if it wouldn't be better to store all
the AST information in the bytecode file so that I don't have
compilation information in one place and the code for it in another. To
do this, I'd need support from LLVM to put "compile time information"
into a bytecode or assembly file. This information would never be used
at runtime and never "optimized out". It just sits in the bytecode file
taking up space until some compiler (or other tool) asks for it.
I've given some thought to this and here's how I think it should go:
1. Compile time information is placed in separate section of the
bytecode file (presumably at the end to reduce runtime I/O)
2. Nothing in the compile time information is used at runtime. It
is neither the subject of optimization nor execution.
3. Compile time information sections are completely optional. A
given language compiler need not utilize them and they have no
bearing on correct execution of the program.
4. Compile time information is loaded only explicitly (presumably
by a compiler based on LLVM) but also possibly by an
optimization pass that would like to understand the higher-order
semantics better (this would require the pass to be language
specific, presumably).
5. Compile time information is defined as a set of global variables
just the same as for the runtime definitions. The full use of
LLVM Types (especially derived types like structures and
pointers) can be used to define the global variables.
6. There are never any naming conflicts between compile time
information variables in different modules. Each compile time
global variable is, effectively, scoped in its module. This
allows compiler writers to use the same name for various pieces
of data in every module emitted without clashing.
7. The exact same facility for dealing with module scoped types and
variables are used to deal with the compile time information.
When asked for it, the VMCore would produce a SymbolTable that
references all the global types and variables in the compile
time information.
8. LLVM assembler and bytecode reader will assure the syntactic
integrity of the compile time information as it would for any
other bytecode. It checks types, pointer references, etc. and
emits warnings (errors?) if the compiler information is not
syntactically valid.
9. LLVM makes no assertions about the semantics or content of the
compile time information. It can be anything the compiler writer
wishes to express to retain compilation information. Correctness
of the information content (beyond syntactics) is left to the
compiler writer. Exceptions to this rule may be warranted where
there is general applicability to multiple source languages.
Debug (file & line number) info would seem to be a natural
exception.
10. Compile time information sections are marked with a name that
relates to the high-level compiler that produced them. This
avoids confusion when one language attempts to read the compile
time information of another language.
This is somewhat like an open ended, generalized ELF section for keeping
track of compiler and/or debug information. Because its based on
existing capabilities of LLVM, I don't think it would be particularly
difficult to implement either.
Reid.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20031116/f02e63aa/attachment.sig>
More information about the llvm-dev
mailing list