[LLVMdev] LLVM and managed languages

Fri Jul 1 11:05:44 PDT 2011

So I've been using LLVM for about 4 years now, and I've posted a lot on this
list about specific issues. What I would like to do is step back for a
moment and give my "big picture" assessment of LLVM overall, particularly
with respect to developing a "managed" language like Java / C# or my own
language, Tart.

Obviously, I feel that LLVM is the best choice out there, otherwise I would
of switched to using other alternatives long ago. At the same time, however,
I've observed that the great majority of experimental languages - Groovy,
Scala, and many others - are built on top of the Java virtual machine, which
seems to be the platform of choice for language experimenters. In a similar
vein, if you spend any time studying the academic research on garbage
collection algorithms, you'll see that the JVM also appears to be the
platform of choice here as well.

One question to ask is, what factors would weigh heavily in the decision to
choose the JVM over LLVM when implementing a managed language? On the
positive side, LLVM's mature and simple APIs make it a breeze to manipulate
types and functions, and it's modular architecture means I can use the parts
I want without being forced to use parts I don't need. And the ability to go
straight to machine code means that I don't need to ship a complex virtual
machine with my executables.

I would also like to list the areas where I *don't need help* from LLVM -
that is, these are things I can easily handle myself in the frontend:

   - Dynamic dispatch and virtual methods / multi-methods
   - Boxing and unboxing of value types
   - Reflection
   - Memory management

I mention these items for two reasons: First, because these are services
that *would* be provided by a JVM, and I want people to know that the lack
of these items in LLVM does not in any way count as a detriment. And
secondly, because I've occasionally heard people on this list ask for
features like these, and I think it would be a waste of time to spend effort
implementing something that the frontend can easily handle.

Next, I want to list the areas where I think LLVM is lacking:

The first issue is more of a community-related one, which is that IMHO LLVM
doesn't have a lot of forward momentum in the experimental languages space.
The vast majority of contributions to the LLVM code base fall into three
categories: Clang-related work, ports to new backends, and optimization
research. Parts of LLVM which don't fall into these categories - like
garbage collection for example - tend to languish.

Note that this is not a criticism, but merely an observation - if it were a
criticism I'd have to blame myself as much as anyone else. There are good
reasons why things are as they are - the few people like myself who are
working on non-C-like languages generally don't have the energy to both work
on their front ends and at the same time do any serious amount of work on
the LLVM infrastructure. (You all saw what happened with the union types
fiasco - I contributed all of the front-end parts for declaring and
serializing unions, in the hopes that someone more knowledgeable would pick
up the effort of implementing them on the code generation side - but that
never happened and the feature got removed after six months of no progress.)

I think that LLVM's best hope in this department is that one of these
experimental languages becomes popular enough to attract a serious
contributor base of it's own, and that some of that effort can bleed off
into improving LLVM. Some of that was actually starting to happen in the
case of unladen-swallow, before that project was shut down.

The rest of my issues are more specific:

*Garbage collection is still way too difficult*. The biggest problem is the
inability to track SSA values - it requires the frontend to generate very
inefficient and error-prone code, manually spilling SSA values to the stack
around nearly every function call. I've written proposals for improving the
situation, which I won't repeat here - but realistically there's no way that
I am ever going to have time learn to enough about LLVM's code generation
passes to implement something like this myself.

Another area which I'd like to see supported is stack walking - that is,
starting from the current call frame, iterate through all of the call frames
above it. Currently the only way to do this is via platform-specific
assembly language - I'd think it would be relatively simple to make an
intrinsic that does this. (Note that the current stack intrinsics are
unusable, because they are (a) unreliable, and (b) inefficient. Taking an
integer index of a stack frame as a parameter is not the right way to do
it.)

I have to say, if there's any single issue that could make me give up on
LLVM and switch over to using something like a JVM, it would be this -
because as far as I can tell, LLVM's garbage collection features are in
exactly the same state they were in when I started working with LLVM four
years ago.

*Generating debug info* is another area which is very hard to use. The new
DIBuilder class was a significant improvement, but the whole area is still
very complex, it's extremely easy to make mistakes and then spend days or
weeks trying to figure out what went wrong - there's no type safety or
validation that can prevent you from making such mistakes. Over the last 4
years I've spent many man-months of time wrestling with DWARF.

(One also wonders if DWARF is even the right answer. Especially compared to
the JVM - the Java debuggers are able to derive much of their information
from the reflection and class metadata that's already there in the class
files, so from that perspective DWARF adds a huge amount of overhead.)

*Platform ABI limitations* - Currently LLVM requires the frontend developer
to know quite a lot about the platform ABI - for example whether you are
allowed to pass a struct of a certain size as a value type rather than as a
reference. The thing is, experimental languages like mine generally don't
care so much about ABI compatibility - sure, we'd like to be able to call C
library functions once in a while (and we don't mind doing the extra work in
those cases), but most of the time we just want to pass a data type around
and expect it to work. Requiring the use of different techniques on
different platforms makes the situation considerably more complex.

*Light-weight coroutines* would be a "nice to have", as would better
*concurrency
primitives*. These are things I could do on my own, but it would be better,
I think, to have them in LLVM - because in my view of the world, anything
that requires lots of architecture-specific knowledge ideally belongs on the
LLVM side of the line.

There's been a lot of discussion about divide-by-zero errors and other
*non-declared
exceptions*. Having this available would be a great help.

*Named structure types* - as per Chris's proposal - would be a major
simplification, as it would allow declaration of pointer fields without
having to import the code that defines the structure of the thing that the
pointer is pointing to.

Anyway that's pretty much my list of big-ticket items. As you can see,
anyone who cares about these issues would have to think very hard before
choosing LLVM over the JVM as an implementation platform.

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110701/e6f6eba9/attachment.html>