[LLVMdev] LLVM IR is a compiler IR

Wed Oct 5 01:55:17 PDT 2011

Hi Dan,

I read five distinct requests in your well-written remarks, which may appeal to different people:

1. How can we make LLVM more portable? As Chris later pointed out, it's hard to achieve this goal on the input side while preserving C semantics, since even C source code doesn't really have that property. On the platform front, recent discussions about "non-standard" architectures highlighted that most of the LLVM effort is really around x86 and ARM, and platforms that deviate from these reference points tend to be second thoughts.

2. How can we make LLVM more stable over time? As a regular user of LLVM, I initially found  the frequent changes in LLVM painful. On the other hand, that effort is not a high price if it keeps the code base fluid. It wouldn't hurt to take an approach like OpenGL where new stuff is tested through a shared "extensions" mechanism, and deprecation of old interfaces spans years. "It no longer works" is a message we see a little too often on LLVM-dev.

3. How can we clarify the specification of LLVM? In the good old Unix tradition, the source code is the documentation, and the "documentation" explains the bugs and gives simplistic examples. But standard-level specification is really hard and tends to spend inordinate amount of time on corner cases ordinary folks don't care about. To wit: C++ and C++ ABI standardization efforts. LLVM has the luxury to be able to just assert in the corner cases, and deal with it on demand. 

4. How can we address minority needs in LLVM? Being a minority here, I can only second that. I'd say that LLVM has to keep their priorities right. As someone else pointed out, one reason to pick up LLVM is because it gives me interoperability with C. I'm not willing to give that up, and that means I have to learn a little bit of the C non-portable way of doing things. That being said, minorities are also the guys keeping you on your toes.

5. How can we avoid selfish kludges and self-imposed limitations in the LLVM source code base? Probably the more immediately actionable point. IMO, things tend to go in the right direction, at least in my experience. But it's always easy to lapse.

Overall, I see these not so really as technical or architectural issues. Rather, I'd say that LLVM is very "market driven", i.e. the largest communities (C and x86) tend to grab all the attention. Still, it has reached a level of maturity where even smaller teams like ours can benefit from the crumbs.

That being said, can we build a portable LLVM IR on top of the existing stuff without giving up C compatibility? I'm not sure. I would settle for a few sub-goals that may be more easily achieved, e.g. define a subset of the IR that is exactly as portable as C, or ensuring that object layout settings default to the target, but can effectively be overridden in a meaningful way (think: C++ ABI inheritance rules, HP28/HP48 internal object layout, ...)

My two bytes
Christophe

On 4 oct. 2011, at 20:53, Dan Gohman wrote:

> In this email, I argue that LLVM IR is a poor system for building a
> Platform, by which I mean any system where LLVM IR would be a
> format in which programs are stored or transmitted for subsequent
> use on multiple underlying architectures.
> 
> LLVM IR initially seems like it would work well here. I myself was
> once attracted to this idea. I was even motivated to put a bunch of
> my own personal time into making some of LLVM's optimization passes
> more robust in the absence of TargetData a while ago, even with no
> specific project in mind. There are several things still missing,
> but one could easily imagine that this is just a matter of people
> writing some more code.
> 
> However, there are several ways in which LLVM IR differs from actual
> platforms, both high-level VMs like Java or .NET and actual low-level
> ISAs like x86 or ARM.
> 
> First, the boundaries of what capabilities LLVM provides are nebulous.
> LLVM IR contains:
> 
> * Explicitly Target-specific features. These aren't secret;
>  x86_fp80's reason for being is pretty clear.
> 
> * Target-specific ABI code. In order to interoperate with native
>  C ABIs, LLVM requires front-ends to emit target-specific IR.
>  Pretty much everyone around here has run into this.
> 
> * Implicitly Target-specific features. The most obvious examples of
>  these are all the different Linkage kinds. These are all basically
>  just gateways to features in real linkers, and real linkers vary
>  quite a lot. LLVM has its own IR-level Linker, but it doesn't
>  do all the stuff that native linkers do.
> 
> * Target-specific limitations in seemingly portable features.
>  How big can the alignment be on an alloca? Or a GlobalVariable?
>  What's the widest supported integer type? LLVM's various backends
>  all have different answers to questions like these.
> 
> Even ignoring the fact that the quality of the backends in the
> LLVM source tree varies widely, the question of "What can LLVM IR do?"
> has numerous backend-specific facets. This can be problematic for
> producers as well as consumers.
> 
> Second, and more fundamentally, LLVM IR is a fundamentally
> vague language. It has:
> 
> * Undefined Behavior. LLVM is, at its heart, a C compiler, and
>  Undefined Behavior is one of its cornerstones.
> 
>  High-level VMs typically raise predictable exceptions when they
>  encounter program errors. Physical machines typically document
>  their behavior very extensively. LLVM is fundamentally different
>  from both: it presents a bunch of rules to follow and then offers
>  no description of what happens if you break them.
> 
>  LLVM's optimizers are built on the assumption that the rules
>  are never broken, so when rules do get broken, the code just
>  goes off the rails and runs into whatever happens to be in
>  the way. Sometimes it crashes loudly. Sometimes it silently
>  corrupts data and keeps running.
> 
>  There are some tools that can help locate violations of the
>  rules. Valgrind is a very useful tool. But they can't find
>  everything. There are even some kinds of undefined behavior that
>  I've never heard anyone even propose a method of detection for.
> 
> * Intentional vagueness. There is a strong preference for defining
>  LLVM IR semantics intuitively rather than formally. This is quite
>  practical; formalizing a language is a lot of work, it reduces
>  future flexibility, and it tends to draw attention to troublesome
>  edge cases which could otherwise be largely ignored.
> 
>  I've done work to try to formalize parts of LLVM IR, and the
>  results have been largely fruitless. I got bogged down in
>  edge cases that no one is interested in fixing.
> 
> * Floating-point arithmetic is not always consistent. Some backends
>  don't fully implement IEEE-754 arithmetic rules even without
>  -ffast-math and friends, to get better performance.
> 
> If you're familiar with "write once, debug everywhere" in Java,
> consider the situation in LLVM IR, which is fundamentally opposed
> to even trying to provide that level of consistency. And if you allow
> the optimizer to do subtarget-specific optimizations, you increase
> the chances that some bit of undefined behavior or vagueness will be
> exposed.
> 
> Third, LLVM is a low level system that doesn't represent high-level
> abstractions natively. It forces them to be chopped up into lots of
> small low-level instructions.
> 
> * It makes LLVM's Interpreter really slow. The amount of work
>  performed by each instruction is relatively small, so the interpreter
>  has to execute a relatively large number of instructions to do simple
>  tasks, such as virtual method calls. Languages built for interpretation
>  do more with fewer instructions, and have lower per-instruction
>  overhead.
> 
> * Similarly, it makes really-fast JITing hard. LLVM is fast compared
>  to some other static C compilers, but it's not fast compared to
>  real JIT compilers. Compiling one LLVM IR level instruction at a
>  time can be relatively simple, ignoring the weird stuff, but this
>  approach generates comically bad code. Fixing this requires
>  recognizing patterns in groups of instructions, and then emitting
>  code for the patterns. This works, but it's more involved.
> 
> * Lowering high-level language features into low-level code locks
>  in implementation details. This is less severe in native code,
>  because a compiled blob is limited to a single hardware platform
>  as well. But a platform which advertizes architecture independence
>  which still has all the ABI lock-in of HLL implementation details
>  presents a much more frightening backwards compatibility specter.
> 
> * Apple has some LLVM IR transformations for Objective-C, however
>  the transformations have to reverse-engineer the high-level semantics
>  out of the lowered code, which is awkward. Further, they're
>  reasoning about high-level semantics in a way that isn't guaranteed
>  to be safe by LLVM IR rules alone. It works for the kinds of code
>  clang generates for Objective C, but it wouldn't necessarily be
>  correct if run on code produced by other front-ends. LLVM IR
>  isn't capable of representing the necessary semantics for this
>  unless we start embedding Objective C into it.
> 
> 
> In conclusion, consider the task of writing an independent implementation
> of an LLVM IR Platform. The set of capabilities it provides depends on who
> you talk to. Semantic details are left to chance. There are features
> which require a bunch of complicated infrastructure to implement which
> are rarely used. And if you want light-weight execution, you'll
> probably need to translate it into something else better suited for it
> first. This all doesn't sound very appealing.
> 
> LLVM isn't actually a virtual machine. It's widely acknoledged that the
> name "LLVM" is a historical artifact which doesn't reliably connote what
> LLVM actually grew to be. LLVM IR is a compiler IR.
> 
> Dan
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev