[LLVMdev] Garbage collection

Thu Feb 26 18:17:50 PST 2009

On Thursday 26 February 2009 17:25:56 Chris Lattner wrote:
> In my ideal world, this would be:
>
> 1. Subsystems [with clean interfaces] for thread management,
> finalization, object model interactions, etc.
> 2. Within different high-level designs (e.g. copying, mark/sweep, etc)
> there can be replaceable policy components etc.
> 3. A couple of actual GC implementations built on top of #1/2.
> Ideally there would only be a couple of high-level collectors that can
> be parameterized by replacing subsystems and policies.
> 4. A very simple language implementation that uses the facilities, on
> the order of complexity as the kaleidoscope tutorial.
>
> As far as I know, there is nothing that prevents this from happening
> today, we just need leadership in the area to drive it.  To avoid the
> "ivory tower" problem, I'd strongly recommend starting with a simple
> GC and language and get the whole thing working top to bottom. From 
> there, the various pieces can be generalized out etc.  This ensures
> that there is always a *problem being solved* and something that works
> and is testable.

I fear that the IR generator and GC are too tightly coupled.

For example, the IR I am generating shares pointers read from the heap even 
across function calls. That is built on the assumption that the pointers are 
immutable and, therefore, that the GC is non-moving. The generated code is 
extremely efficient even though I have not even enabled LLVM's optimizations 
yet precisely because of all this shared immutable data.

If you wanted to add a copying GC to my VM you would probably replace every 
lookup of the IR register with a lookup of the code to reload it, generating 
a lot of redundant loads that would greatly degrade performance so you would 
rely upon LLVM's optimization passes to clean it up again. However, I bet 
they do not have enough information to recover all of the lost performance. 
So there is a fundamental conflict here where a simple GC design decision has 
a drastic effect on the IR generator.

Although it is theoretically possible to parameterize the IR generator 
sufficiently to account for all possible combinations of GC designs I suspect 
the result would be a mess. Consequently, perhaps it would be better to 
consider IR generation and the GC as a single entity and, instead, factor 
them both out using a common high-level representation not dissimilar to JVM 
or CLR bytecode in terms of functionality but much more closely related to 
LLVM's IR?

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e