[LLVMdev] LLVM IR is a compiler IR

Tue Oct 4 14:42:13 PDT 2011

Interestingly I wrote a bytecode language exactly like this for my master's thesis, based atop of LLVM. I abandoned the project after graduating, but it had it's promising moments.
________________________________________
From: llvmdev-bounces at cs.uiuc.edu [llvmdev-bounces at cs.uiuc.edu] On Behalf Of Talin [viridia at gmail.com]
Sent: 04 October 2011 21:23
To: Dan Gohman
Cc: llvmdev at cs.uiuc.edu Mailing List
Subject: Re: [LLVMdev] LLVM IR is a compiler IR

Thank you for writing this. First, I'd like to say that I am in 100% agreement with your points. I've been tempted many times to write something similar, although what you've written has been articulated much better than what I would have said.

When I try to explain to people what LLVM is I say "It's essentially the back-end of a compiler" - a job it does extremely well. I don't say "It's a virtual machine", because that is a job it doesn't do very well at all.

I'd like to add a couple of additional items to your list - first, LLVM IR isn't stable, and it isn't backwards compatible. Bitcode is not useful as an archival format, because a bitcode file cannot be loaded if it's even a few months out of sync with the code that loads it. Loading a bitcode file that is years old is hopeless.

Also, bitcode is large compared to Java or CLR bitcodes. This isn't such a big deal, but for people who want to ship code over the network it could be an issue.

I've been thinking that it would be a worthwhile project to develop a high-level IR that avoids many of the issues that you raise. Similar in concept to Java byte code, but without Java's limitations - for example it would support pass-by-value types. (CLR has this, but it also has limitations). Of course, this IR would of necessity be less flexible than LLVM IR, but you could always dip into IR where needed, such as C programs dip into assembly on occasion.

This hypothetical IR language would include a type system that was rich enough to express all of the DWARF semantics - so that instead of having two parallel representations of every type (one for LLVM's code generators and one for DWARF), you could instead generate both the LLVM types and the DWARF DI's from a common representation. This would have a huge savings in both complexity and the size of bitcode files.

On Tue, Oct 4, 2011 at 11:53 AM, Dan Gohman <gohman at apple.com<mailto:gohman at apple.com>> wrote:
In this email, I argue that LLVM IR is a poor system for building a
Platform, by which I mean any system where LLVM IR would be a
format in which programs are stored or transmitted for subsequent
use on multiple underlying architectures.

LLVM IR initially seems like it would work well here. I myself was
once attracted to this idea. I was even motivated to put a bunch of
my own personal time into making some of LLVM's optimization passes
more robust in the absence of TargetData a while ago, even with no
specific project in mind. There are several things still missing,
but one could easily imagine that this is just a matter of people
writing some more code.

However, there are several ways in which LLVM IR differs from actual
platforms, both high-level VMs like Java or .NET and actual low-level
ISAs like x86 or ARM.

First, the boundaries of what capabilities LLVM provides are nebulous.
LLVM IR contains:

 * Explicitly Target-specific features. These aren't secret;
  x86_fp80's reason for being is pretty clear.

 * Target-specific ABI code. In order to interoperate with native
  C ABIs, LLVM requires front-ends to emit target-specific IR.
  Pretty much everyone around here has run into this.

 * Implicitly Target-specific features. The most obvious examples of
  these are all the different Linkage kinds. These are all basically
  just gateways to features in real linkers, and real linkers vary
  quite a lot. LLVM has its own IR-level Linker, but it doesn't
  do all the stuff that native linkers do.

 * Target-specific limitations in seemingly portable features.
  How big can the alignment be on an alloca? Or a GlobalVariable?
  What's the widest supported integer type? LLVM's various backends
  all have different answers to questions like these.

Even ignoring the fact that the quality of the backends in the
LLVM source tree varies widely, the question of "What can LLVM IR do?"
has numerous backend-specific facets. This can be problematic for
producers as well as consumers.

Second, and more fundamentally, LLVM IR is a fundamentally
vague language. It has:

 * Undefined Behavior. LLVM is, at its heart, a C compiler, and
  Undefined Behavior is one of its cornerstones.

  High-level VMs typically raise predictable exceptions when they
  encounter program errors. Physical machines typically document
  their behavior very extensively. LLVM is fundamentally different
  from both: it presents a bunch of rules to follow and then offers
  no description of what happens if you break them.

  LLVM's optimizers are built on the assumption that the rules
  are never broken, so when rules do get broken, the code just
  goes off the rails and runs into whatever happens to be in
  the way. Sometimes it crashes loudly. Sometimes it silently
  corrupts data and keeps running.

  There are some tools that can help locate violations of the
  rules. Valgrind is a very useful tool. But they can't find
  everything. There are even some kinds of undefined behavior that
  I've never heard anyone even propose a method of detection for.

 * Intentional vagueness. There is a strong preference for defining
  LLVM IR semantics intuitively rather than formally. This is quite
  practical; formalizing a language is a lot of work, it reduces
  future flexibility, and it tends to draw attention to troublesome
  edge cases which could otherwise be largely ignored.

  I've done work to try to formalize parts of LLVM IR, and the
  results have been largely fruitless. I got bogged down in
  edge cases that no one is interested in fixing.

 * Floating-point arithmetic is not always consistent. Some backends
  don't fully implement IEEE-754 arithmetic rules even without
  -ffast-math and friends, to get better performance.

If you're familiar with "write once, debug everywhere" in Java,
consider the situation in LLVM IR, which is fundamentally opposed
to even trying to provide that level of consistency. And if you allow
the optimizer to do subtarget-specific optimizations, you increase
the chances that some bit of undefined behavior or vagueness will be
exposed.

Third, LLVM is a low level system that doesn't represent high-level
abstractions natively. It forces them to be chopped up into lots of
small low-level instructions.

 * It makes LLVM's Interpreter really slow. The amount of work
  performed by each instruction is relatively small, so the interpreter
  has to execute a relatively large number of instructions to do simple
  tasks, such as virtual method calls. Languages built for interpretation
  do more with fewer instructions, and have lower per-instruction
  overhead.

 * Similarly, it makes really-fast JITing hard. LLVM is fast compared
  to some other static C compilers, but it's not fast compared to
  real JIT compilers. Compiling one LLVM IR level instruction at a
  time can be relatively simple, ignoring the weird stuff, but this
  approach generates comically bad code. Fixing this requires
  recognizing patterns in groups of instructions, and then emitting
  code for the patterns. This works, but it's more involved.

 * Lowering high-level language features into low-level code locks
  in implementation details. This is less severe in native code,
  because a compiled blob is limited to a single hardware platform
  as well. But a platform which advertizes architecture independence
  which still has all the ABI lock-in of HLL implementation details
  presents a much more frightening backwards compatibility specter.

 * Apple has some LLVM IR transformations for Objective-C, however
  the transformations have to reverse-engineer the high-level semantics
  out of the lowered code, which is awkward. Further, they're
  reasoning about high-level semantics in a way that isn't guaranteed
  to be safe by LLVM IR rules alone. It works for the kinds of code
  clang generates for Objective C, but it wouldn't necessarily be
  correct if run on code produced by other front-ends. LLVM IR
  isn't capable of representing the necessary semantics for this
  unless we start embedding Objective C into it.

In conclusion, consider the task of writing an independent implementation
of an LLVM IR Platform. The set of capabilities it provides depends on who
you talk to. Semantic details are left to chance. There are features
which require a bunch of complicated infrastructure to implement which
are rarely used. And if you want light-weight execution, you'll
probably need to translate it into something else better suited for it
first. This all doesn't sound very appealing.

LLVM isn't actually a virtual machine. It's widely acknoledged that the
name "LLVM" is a historical artifact which doesn't reliably connote what
LLVM actually grew to be. LLVM IR is a compiler IR.

Dan

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu>         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

--
-- Talin

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.