[PATCH] Lower idempotent RMWs to fence+load

Tue Sep 23 11:29:34 PDT 2014

Thanks for the review. Answers inline, and patch next.

================
Comment at: lib/CodeGen/AtomicExpandPass.cpp:96
@@ -91,3 +95,3 @@
     } else if (RMWI && TargetLowering->shouldExpandAtomicRMWInIR(RMWI)) {
       MadeChange |= expandAtomicRMW(RMWI);
     } else if (CASI && TargetLowering->hasLoadLinkedStoreConditional()) {
----------------
jfb wrote:
> I find that this code isn't quite obvious: if the RMW is idempotent then you try to simplify it in a target-defined way, and if that fails you call `expandAtomicRMW` from `simplifyIdempotentRMW`. There should be only one place where you call `expandAtomicRMW`, and it should be here.
> 
> Maybe add a `virtual shouldSimplifyIdempotentRMWInIR` function, or change the control flow here to first try to simplify, if that fails check if you should expand, and if that also fails when keep going (I think this is better).
I've changed the control-flow here, it is simpler (no more redundancy, only one call to expandAtomicRMW), but the control-flow lost a bit of readability. Please tell-me if it looks ok to you.

================
Comment at: lib/CodeGen/AtomicExpandPass.cpp:481
@@ +480,3 @@
+      return C->isMinusOne();
+    default:
+      return false;
----------------
jfb wrote:
> You could handle `min`/`max`/`umin`/`umax` but those seem somewhat useless. I guess other optimizations won't do partial evaluation into an atomic instruction, so constant propagation will feed up to the value operand of the atomic but not further, so it may be possible that this optimization would fire?
There are two different optimizations doable for min/max/umin/umax:
- simplify them when the value operand is the constant INT_MIN, INT_MAX, ...
- simplify them when the value operand is necessarily the same value as the value already in the memory cell (for some definition of value, already and memory cell...).
It is not exactly clear to me which you want me to do. The first one is rather trivial to add, I would just need to find a way of getting INT_MIN/INT_MAX for any type size (probably not very hard, but I would rather keep it in a separate patch).

The second way would probably have to be a different pass, dealing with nightmarish complexity in the form of aliasing issues and subtle memory model issues (just defining what "current value in a memory location" is basically impossible in C11), and generally I don't want to open that can of worms unless it can be shown as crucial in actual benchmarks (I lost months trying to prove the correctness of simple variants of that class of optimizations with nothing to show for it but a slew of counter-examples and traps, so I am quite cautious about anything that looks like it).

================
Comment at: lib/CodeGen/AtomicExpandPass.cpp:490
@@ +489,3 @@
+  auto ResultingLoad = TLI->lowerIdempotentRMWIntoFencedLoad(RMWI);
+  if (ResultingLoad) {
+    if (TLI->shouldExpandAtomicLoadInIR(ResultingLoad))
----------------
jfb wrote:
> Merge the two above lines, LLVM usually follows that style.
Done.

================
Comment at: lib/CodeGen/AtomicExpandPass.cpp:492
@@ +491,3 @@
+    if (TLI->shouldExpandAtomicLoadInIR(ResultingLoad))
+      expandAtomicLoad(ResultingLoad);
+    return true;
----------------
jfb wrote:
> Could you instead add the `LoadInst` to the `AtomicInsts` that you're iterating through? Same as before, I don't really like the repeated logic since it makes the code harder to modify and follow.
I tried doing that originally, but it is not obvious how to do it cleanly:
- adding the LoadInst to AtomicInsts might invalidate the iterator and break everything
- a goto from that case to the beginning of the function would do the trick, but .. "goto" is not really recommended in LLVM I think
- adding an extra level of nesting with a while loop would also work, but probably not be extremely readable (and the nesting of the control-flow in runOnFunction is already uncomfortable), and feels a bit overkill.
So I took this approach with a bit of redundancy because the other solutions looked even worse. I agree that it is ugly, I would love suggestions on how to clean it up.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:17029
@@ +17028,3 @@
+  // there is no benefit in turning such RMWs into loads, and it is actually
+  // harmfull as it introduces a mfence.
+  if (MemType->getPrimitiveSizeInBits() > NativeWidth)
----------------
jfb wrote:
> s/harmfull/harmful/
Fixed.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:17030
@@ +17029,3 @@
+  // harmfull as it introduces a mfence.
+  if (MemType->getPrimitiveSizeInBits() > NativeWidth)
+    return nullptr;
----------------
jfb wrote:
> Not all primitive types smaller than native width have an appropriate load. Will this still work in these cases (or is it impossible to get here with types that aren't 8/16/32/64)?
Luckily, LLVM only allows power-of-2 sizes (in bytes) for Atomic RMW operations. So we don't have to bother checking for some abominations like an i24 or i13.

================
Comment at: test/CodeGen/X86/atomic_idempotent.ll:6
@@ +5,3 @@
+; (such as atomic add 0) can be replaced by an mfence followed by a mov.
+; This is explained (with the motivation for such an optimisation) in
+; http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf
----------------
jfb wrote:
> optimization
Fixed.

================
Comment at: test/CodeGen/X86/atomic_idempotent.ll:47
@@ +46,2 @@
+  ret i128 %1
+}
----------------
jfb wrote:
> Could you also try weird integer sizes?
> 
> I also think that testing more RMW operations at least for 32-bit would be good, especially `and`.
I've added tests for and/sub.
However atomic RMWs are not defined/accepted by LLVM for sizes other than power of 2 number of bytes (luckily for my sanity).

http://reviews.llvm.org/D5422