[LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions

Fri May 30 04:24:50 PDT 2014

On 29 May 2014, at 18:21, Tim Northover <t.p.northover at gmail.com> wrote:

>    void atomic_foo(int *addr) {
>      int oldval = *addr;
>      do {
>        newval = foo(oldval);
>      } while (__c11_compare_exchange_weak(addr, &oldval, newval));

This particular example is a bit difficult, because the best representation depends on the complexity of foo().  If foo() is simple (and guaranteed not to contain any atomics) then you'd want to do the load-linked at the start and store-conditional at the end, something like this:

     int oldval = load_linked(addr);
     do {
       newval = foo(oldval);
     } while (!store_conditional(addr, newval));

If, however, foo() is not visible, is complex, or has some other memory accesses[1] then the correct lowering would be a load at the start and then a load-linked/store conditional (but without a loop) at the end.

     int oldval = *addr;
     int oldval2;
     do {
       newval = foo(oldval);
       oldval2 = load_linked(addr);
     } while ((oldval != oldval2) || !store_conditional(addr, newval));

The lowering that we currently generate, however, is more like this:

     int oldval = *addr;
     int success = 1;
     do {
       newval = foo(oldval);
       int oldval2;
       do {
         oldval2 = load_linked(addr);
         if (oldval2 != oldval) {
           success = 0;
           break;
         }
       } while ((oldval != oldval2) || !store_conditional(addr, newval));
     } while (!success);

This is clearly suboptimal.  My problem with your current proposal is that we really need to preserve the weak cmpxchg semantics through optimisation to ensure correctness.  The front end does not know (except in trivial cases) which of the two earlier forms is correct.  It could always emit the second form for any weak cmpxchg, but that then makes optimisation harder because optimisers have to infer that it means a weak cmpxchg[2].

When we get to CodeGenPrepare, it's fairly easy to transform the original structure into the first correct output, if there are no memory accesses in the middle.  Transforming the second version into the first is harder unless the structure is preserved.  

In short, I believe that adding ll/sc to the IR will:

- Not make generating IR easier than adding weak cmpxchg
- Not make optimising IR easier than adding weak cmpxchg
- Not make code generation easier than adding weak cmpxchg
- Be more invasive than adding weak cmpxchg

David

[1] RISC-V, for example, requires that there be no memory accesses *at all* between ll and sc, some ARM implementations require that there be no memory accesses within the same cache line (see a bug report from a year or so ago on Apple hardware).

[2] There's a lot of work being done by Peter Sewell's group and others currently on what are semantically valid transforms for the C11 memory model, so as long as we preserve those semantics in the IR we have lots of opportunities to improve optimisations.