[LLVMdev] Named register variables GNU-style, deux

Sat Apr 19 15:41:29 PDT 2014

Hello all,

Recently on this list (as of last month), Renato Golin of Linaro
posted a thread entitled "Named register variables, GNU-style"[1].
This thread concerned the implementation of the GNU Register variables
feature for LLVM. I'd like to give some input on this, as a developer
of the Glasgow Haskell Compiler, as we are a user of this feature.
Furthermore, our use case is atypical - it is efficiency oriented, not
hardware oriented (e.g. I believe the Linux APIC x86 subsystem uses
them for hardware, as well as MIPS Linux as mentioned). Bear with me
on the details.

I'll say up front our use case alone shouldn't sway major decisions,
nor am I screaming for the feature - I can sleep at night. But I found
there was a surprising lack of highlighted use cases, and perhaps in
the future if things change, these points can have some insight.

The summary is this: we use this feature in our garbage collector to
steal a register that is solely dedicated to a thread-local storage
for our multicore runtime system. This thread local data structure is
possibly the most performance sensitive variable in the entire
multicore system, to the point where we have spent significant time
optimizing every read or write, load or spill that could affect it.

Furthermore, the GC is tied to the threading system in several ways
and is parallel itself - a loss in performance here directly equates
to a large overall performance loss for every parallel, multicore
program.

The lack of this feature is now causing us significant problems,
particularly on Mac OS X, as it now uses Clang by default.

You would think that considering this variable is (p)thread local, we
could just use a __thread variable, or pthread_{get,set}specific to
manage. But on OS X, both of these equate to an absolutely huge
performance loss, upwards of 25%. Which is unacceptable, realistically
speaking, but we've had to deal with it.

On Linux, the situation isn't so bad. The ABI allows a __thread
variable to just be stored at a direct offset to the %fs segment,
meaning that a read/write is still very fast. In fact, __thread is
preferable on i386 Linux: the pathetic number of registers means
stealing one is a loss, not a win.

The situation is not so good on x86_64 OS X. Generally we would steal
r13 on a 64-bit platform. But that's not allowed with Clang.
Furthermore, the __thread implementation on OS X is terrible compared
to Linux: while internally it uses %fs for a specific set of internal,
predefined keys, and it also uses them for __thread and
pthread_{get,set}specific, a read or write to a __thread variable does
NOT translate to a direct read/write. It translates to an indirect
call through %rdi.

In other words, this code:

#include <stdio.h>
#include <stdlib.h>

__thread int foo;

int main(int ac, char* av[]) {
  if (ac < 2) foo = 10;
  else foo = atoi(av[1]);

  printf("foo = %d\n", foo);

  return 0;
}

Translates to this on x86_64 Linux with Clang:

(gdb) disassemble main
Dump of assembler code for function main:
   0x00000000004005b0 <+0>: push   %rax
   0x00000000004005b1 <+1>: mov    %rsi,%rax
   0x00000000004005b4 <+4>: cmp    $0x2,%edi
   0x00000000004005b7 <+7>: mov    $0xa,%esi
   0x00000000004005bc <+12>: jl     0x4005d1 <main+33>
   0x00000000004005be <+14>: mov    0x8(%rax),%rdi
   0x00000000004005c2 <+18>: xor    %esi,%esi
   0x00000000004005c4 <+20>: mov    $0xa,%edx
   0x00000000004005c9 <+25>: callq  0x4004b0 <strtol at plt>
   0x00000000004005ce <+30>: mov    %rax,%rsi
   0x00000000004005d1 <+33>: mov    %esi,%fs:0xfffffffffffffffc
   0x00000000004005d9 <+41>: mov    $0x400694,%edi
   0x00000000004005de <+46>: xor    %eax,%eax
   0x00000000004005e0 <+48>: callq  0x400480 <printf at plt>
   0x00000000004005e5 <+53>: xor    %eax,%eax
   0x00000000004005e7 <+55>: pop    %rdx
   0x00000000004005e8 <+56>: retq

It translates to this on x86_64 OS X with Clang:

(lldb) disassemble -m -n main
a.out`main
a.out[0x100000f20]:  pushq  %rbp
a.out[0x100000f21]:  movq   %rsp, %rbp
a.out[0x100000f24]:  pushq  %rbx
a.out[0x100000f25]:  pushq  %rax
a.out[0x100000f26]:  movl   $0xa, %ebx
a.out[0x100000f2b]:  cmpl   $0x2, %edi
a.out[0x100000f2e]:  jl     0x100000f3b               ; main + 27
a.out[0x100000f30]:  movq   0x8(%rsi), %rdi
a.out[0x100000f34]:  callq  0x100000f60               ; symbol stub for: atoi
a.out[0x100000f39]:  movl   %eax, %ebx
a.out[0x100000f3b]:  leaq   0xde(%rip), %rdi          ; foo
a.out[0x100000f42]:  callq  *(%rdi)
a.out[0x100000f44]:  movl   %ebx, (%rax)
a.out[0x100000f46]:  leaq   0x43(%rip), %rdi          ; "foo = %d\n"
a.out[0x100000f4d]:  xorl   %eax, %eax
a.out[0x100000f4f]:  movl   %ebx, %esi
a.out[0x100000f51]:  callq  0x100000f66               ; symbol stub for: printf
a.out[0x100000f56]:  xorl   %eax, %eax
a.out[0x100000f58]:  addq   $0x8, %rsp
a.out[0x100000f5c]:  popq   %rbx
a.out[0x100000f5d]:  popq   %rbp
a.out[0x100000f5e]:  ret

Note the indirect call through %rdi on OS X.

Again, the performance difference between these two snippets cannot be
understated. And pthread_{get,set}specific do even worse because
they're not inlined at all (remember, we're talking 25-30% loss for
all programs.)

There are details here on a bug of ours[2], where I have tracked and
examined this issue for the past year or so. We are getting desperate
to fix this for OS X users - to the point of inlining XNU internals to
either use 'predefined keys' (e.g. OS X has special 'fast TLS' keys
for WebKit on some versions) or inline the 'fast path' of
pthread_{get}specific to do a direct read/write.

We've tried many combinations of compiler settings and tweaks to try
and minimize these effects in the past, but still, a register variable
is essentially superior to all other solutions we've found, especially
on x86_64.

Even passing the thread-local variable around directly as an argument
to every single function is slower - because the function bodies are
so large, a spill will inevitably occur somewhere, causing loads (or
other spills) to interfere with a read/write later. Even combined with
manually lowering/lifting reads/writes, it still results in minor
losses and doesn't guarantee the compiler won't optimistically undo
that. Not not as bad as 30% though, more like 5-7% last I checked. But
that's still significant, still slower, and it's far uglier for us to
implement, and penalizes Linux unfairly unless it gets even uglier.

So, that's the long and short of it. Now we get to LLVM's implementation.

First, obviously, is that this need precludes Renato's proposal that
only non-allocatable registers must be available.[3] We absolutely
have to have GPRs available, and nothing else makes sense for our use
case.

Chandler was strongly against this sort of idea, and likely with good
reason (I don't know anything about parameterizing the LLVM register
set over the set of reserved registers from a user. I don't know
anything about the designs. Sounds like madness to me, too). I have no
input on logistics. But we do need it, otherwise this feature is
totally useless to us.

Also, in the last set of discussions, Joerg Sonnenberger proposed[4]
that these registers are reserved - possibly at the global
(translation unit) level or local (function body) level. We also
require this - temporarily spilling GPRs otherwise will almost
certainly result in the same sort of problem as using a function
argument - they will always collide in ways we cannot control or
predict. We *do* actually care about every single read, write, spill
and load.

Renato replied the need for this is just a of workaround for an
inefficient compiler - and he's right, it is. Otherwise, we wouldn't
do it. :) And based on our observations, I'm sorry to say I don't
think GCC or LLVM are going to magically eliminate that difference of
5-7% loss we saw *consistently* any time soon. It's a realistic
difference to eliminate with enough work - but those wins don't ever
come easy, I know, and our code base is large and complex. That's
going to be a lot of work (but I know you're all smart enough for it).

Again, to recap, GHC alone probably is not enough of a compelling use
case by itself to support these two points on the design - which seem
somewhat radical on review of the original threads. Our needs are
atypical for sure. But I hope they serve as a useful input while you
consider the design space.

And also, I apologize in advanced if this is considered beating a dead horse.

Thanks.

[1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-March/071503.html
[2] https://ghc.haskell.org/trac/ghc/ticket/7602
[3] http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-March/071561.html
[4] http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-March/071620.html

-- 
Regards,
Austin - PGP: 4096R/0x91384671