[LLVMdev] Garbage collection

Fri Feb 27 01:32:30 PST 2009

Jon Harrop wrote:
> On Thursday 26 February 2009 17:25:56 Chris Lattner wrote:
>> In my ideal world, this would be:
>>
>> 1. Subsystems [with clean interfaces] for thread management,
>> finalization, object model interactions, etc.
>> 2. Within different high-level designs (e.g. copying, mark/sweep, etc)
>> there can be replaceable policy components etc.
>> 3. A couple of actual GC implementations built on top of #1/2.
>> Ideally there would only be a couple of high-level collectors that can
>> be parameterized by replacing subsystems and policies.
>> 4. A very simple language implementation that uses the facilities, on
>> the order of complexity as the kaleidoscope tutorial.
>>
>> As far as I know, there is nothing that prevents this from happening
>> today, we just need leadership in the area to drive it.  To avoid the
>> "ivory tower" problem, I'd strongly recommend starting with a simple
>> GC and language and get the whole thing working top to bottom. From 
>> there, the various pieces can be generalized out etc.  This ensures
>> that there is always a *problem being solved* and something that works
>> and is testable.
> 
> I fear that the IR generator and GC are too tightly coupled.
> 
> For example, the IR I am generating shares pointers read from the heap even 
> across function calls. That is built on the assumption that the pointers are 
> immutable and, therefore, that the GC is non-moving. The generated code is 
> extremely efficient even though I have not even enabled LLVM's optimizations 
> yet precisely because of all this shared immutable data.
> 
> If you wanted to add a copying GC to my VM you would probably replace every 
> lookup of the IR register with a lookup of the code to reload it, generating 
> a lot of redundant loads that would greatly degrade performance so you would 
> rely upon LLVM's optimization passes to clean it up again. However, I bet 
> they do not have enough information to recover all of the lost performance. 
> So there is a fundamental conflict here where a simple GC design decision has 
> a drastic effect on the IR generator.
> 
> Although it is theoretically possible to parameterize the IR generator 
> sufficiently to account for all possible combinations of GC designs I suspect 
> the result would be a mess. Consequently, perhaps it would be better to 
> consider IR generation and the GC as a single entity and, instead, factor 
> them both out using a common high-level representation not dissimilar to JVM 
> or CLR bytecode in terms of functionality but much more closely related to 
> LLVM's IR?
> 

IMHO, it would be better if support for GC was dropped from llvm 
altogether. I say this having written a copying GC for my VM toolkit, 
which also uses llvm to do its JIT compilation. And it works just fine!

I have simply avoided the intrinsics.

The problem with the llvm is that to write a GC using the llvm 
intrinsics, you have to mess around with the code-gen part of llvm.

When I want to add a generational collector to my toolkit in the future, 
it is easy to specify write-barriers in the IR. Modifying code-gen to 
handle the intrinsics is a task I would rather avoid.

Mark.