[LLVMdev] Plan to optimize atomics in LLVM

Tue Aug 5 15:46:22 PDT 2014

Hello everyone,

I have recently started on optimizing C11/C++11 atomics in LLVM, and plan
to focus on that for the next two months as an intern in the PNaCl team.
I’ve sent two patches on this topic to Phabricator that fix
http://llvm.org/bugs/show_bug.cgi?id=17281:

http://reviews.llvm.org/D4796

http://reviews.llvm.org/D4797

The first patch is X86-specific, and tries to apply operations with
immediates to atomics without going through a register. The main trouble
here is that the X86 backend appears to respect LLVM memory model instead
of the x86-TSO memory model, and may reorder instructions. In order to
prevent illegal reordering of atomic accesses, the backend converts atomic
accesses to pseudo-instructions in X86InstrCompiler.td (RELEASE_MOV* and
ACQUIRE_MOV*) that are opaque to most of the rest of the backend, and only
lowers those at the very end of the pipeline. I have decided to follow the
same approach, just adding some more RELEASE_* pseudo-instructions rather
than trying to find every possibly misbehaving part of the backend in order
to do early lowering. This lowers the risk and complexity of the patch, but
at the cost of possibly missing some optimization possibilities.

Another trouble I had with this patch is a failure of TableGen type
inference when adding negate/not to the scheme. As a result I have left
these two instructions commented out in this patch. Does anyone have an
idea for how to proceed with this ?

The second patch is more straightforward: in the C11/C++11 memory model
(that LLVM basically borrows), optimizations like DSE can safely fire
across atomic accesses in many cases, basically as long as they are not
operating across a release-acquire pair (see paper referenced in the
comments). So I tweaked MemoryDependenceAnalysis to track such pairs and
only return a clobber result in such cases.

My next steps will probably be to improve the compilation of acquire
atomics in the ARM backend. In particular, they are currently compiled by a
load + dmb, while a load + dependant useless branch + isb is also valid
(see http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for example) and
may be faster. Even better: if there is already a dependant branch (such as
the loop for the lowering of CAS), it is just a cheap isb. The main step
will be switching off the InsertFencesForAtomic flag, and do the lowering
of atomics in the backend, because once an acquire load has been
transformed in an acquire fence, too much information has been lost to
apply this mapping.

Longer term, I hope to improve the fence elimination of the ARM backend
with a kind of PRE algorithm. Both of these improvements to the ARM backend
should be fairly straightforward to port to the POWER architecture later,
and I hope to also do that.

Does this approach seem worthwhile to you ? Can I do anything to help the
review process ?

Thank you,

Robin Morisset
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140805/ae4e3091/attachment.html>