[LLVMdev] Packages

Sun Nov 16 18:42:01 PST 2003

On Sun, 2003-11-16 at 13:01, Vipin Gokhale wrote:
> While on the subject of annotating bytecode with analysis info, could I 
> entice someone to also think about carrying other types of source-level 
> annotations through into bytecode ? This is particularly useful for 
> situations where one wants to use LLVM infrastructure for its 
> whole-program optimization capabilities, however wouldn't want to give 
> up on the ability to debug the final product binary. At the moment, my 
> understanding is that source code annotations like file names, line 
> numbers etc isn't carried through. When one gets around to linking the 
> whole program, you end up with a single .s file of native machine code 
> (which by now is a giant collection of bits picked up from a multitude 
> of source files) with no ability to do symbolic debugging on the 
> resulting binary...

I whole heartedly second that motion.

My purposes are a little different, however. The language for which I'm
compiling (XPL) is fairly high level. For example, data structures such
as hash tables and red black trees are simply referenced as "maps" which
map one type to another. What exact data structure is used underneath is
up to the compiler and runtime optimizer, even allowing transformation
of the underlying type at runtime. For example, a map that initially
contains 3 elements would probably just be a vector of pairs because its
pretty straight forward to linearly scan a small table and it is space
efficient. But, as the map grows in size, it might transform itself into
a sorted vector so binary search can be used and then into a hash table
to reduce the overhead of searching further and then again later on into
a full red-black tree. Of course, all of this depends on whether
insertions and deletions are more frequent than look ups, etc. 

The point here is that XPL needs to keep track of what a given variable
represents at the source level. If the compiler sees a map that is
initially small it might represent it in LLVM assembly as a vector of
pairs. Later on, it gets optimized into being a hash table. In order to
do that and keep track of things, I need to know that the vector of
pairs is >intended< to be a map, not simply a vector of pairs.

Another reason to do this is to speed up compilation time. XPL works
similarly to Java in that you define a module and "import" other modules
into it.  I do not want to recompile a module each time it is imported.
I'd rather just save the static portion of the syntax tree (i.e. the
declarations) somewhere and load it en masse when its referenced in
another compilation.  Currently, I have a partially implemented solution
for this based on my persistent memory module (like an object database
for C++ that allows you to save graphs of objects onto disk via virtual
memory management tricks). When a module is referenced in an import
statement, its disk segment is located and mapped into memory in one
shot .. no parsing, no linking together, just instantly available. For
large software projects with 1000s of modules, this is a HUGE
compilation time win.

Since finding LLVM, I'm wondering if it wouldn't be better to store all
the AST information in the bytecode file so that I don't have
compilation information in one place and the code for it in another.  To
do this, I'd need support from LLVM to put "compile time information"
into a bytecode or assembly file. This information would never be used
at runtime and never "optimized out". It just sits in the bytecode file
taking up space until some compiler (or other tool) asks for it.

I've given some thought to this and here's how I think it should go:

     1. Compile time information is placed in separate section of the
        bytecode file (presumably at the end to reduce runtime I/O)
     2. Nothing in the compile time information is used at runtime. It
        is neither the subject of optimization nor execution.
     3. Compile time information sections are completely optional. A
        given language compiler need not utilize them and they have no
        bearing on correct execution of the program.
     4. Compile time information is loaded only explicitly (presumably
        by a compiler based on LLVM) but also possibly by an
        optimization pass that would like to understand the higher-order
        semantics better (this would require the pass to be language
        specific, presumably).
     5. Compile time information is defined as a set of global variables
        just the same as for the runtime definitions. The full use of
        LLVM Types (especially derived types like structures and
        pointers) can be used to define the global variables. 
     6. There are never any naming conflicts between compile time
        information variables in different modules. Each compile time
        global variable is, effectively, scoped in its module. This
        allows compiler writers to use the same name for various pieces
        of data in every module emitted without clashing.
     7. The exact same facility for dealing with module scoped types and
        variables are used to deal with the compile time information.
        When asked for it, the VMCore would produce a SymbolTable that
        references all the global types and variables in the compile
        time information.
     8. LLVM assembler and bytecode reader will assure the syntactic
        integrity of the compile time information as it would for any
        other bytecode. It checks types, pointer references, etc. and
        emits warnings (errors?) if the compiler information is not
        syntactically valid.
     9. LLVM makes no assertions about the semantics or content of the
        compile time information. It can be anything the compiler writer
        wishes to express to retain compilation information. Correctness
        of the information content (beyond syntactics) is left to the
        compiler writer.  Exceptions to this rule may be warranted where
        there is general applicability to multiple source languages.
        Debug (file & line number) info would seem to be a natural
        exception.
    10. Compile time information sections are marked with a name that
        relates to the high-level compiler that produced them. This
        avoids confusion when one language attempts to read the compile
        time information of another language.

This is somewhat like an open ended, generalized ELF section for keeping
track of compiler and/or debug information.  Because its based on
existing capabilities of LLVM, I don't think it would be particularly
difficult to implement either.

Reid.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20031116/f02e63aa/attachment.sig>