[PATCH] transform fadd chains to increase parallelism

Thu Apr 23 13:50:05 PDT 2015

Hi hfinkel, qcolombet, andreadb,

This is a compromise: with this simple patch, we should always handle a chain of exactly 3 operations optimally, but we're not generating the optimal balanced binary tree for a longer sequence. 

In general, this transform will reduce the dependency chain for a sequence of instructions using N operands from a worst case N-1 dependent operations to N/2 dependent operations. The optimal balanced binary tree would reduce the chain to log2(N).

As I see it, the trade-off for not dealing with longer sequences is: (1) we have less complexity in the compiler, (2) we avoid unknown compile-time blowup calculating a balanced tree, and (3) we don't need to worry about the increased register pressure required to parallelize longer sequences. It also seems unlikely that we would ever encounter really long strings of dependent ops like that in the wild, but I'm not sure how to verify that speculation. FWIW, I see no perf difference for test-suite running on btver2 (x86-64) with -ffast-math and this patch. 

If this patch looks ok, then I can extend it to cover other associative operations such as fmul, fmax, fmin, integer add, integer mul.

This is a partial fix for:
https://llvm.org/bugs/show_bug.cgi?id=17305

and if extended:
https://llvm.org/bugs/show_bug.cgi?id=21768
https://llvm.org/bugs/show_bug.cgi?id=23116

The issue also came up in:
http://reviews.llvm.org/D8941

http://reviews.llvm.org/D9232

Files:
  lib/CodeGen/SelectionDAG/DAGCombiner.cpp
  test/CodeGen/X86/fp-fast.ll

Index: lib/CodeGen/SelectionDAG/DAGCombiner.cpp
===================================================================

--- lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -7652,6 +7652,23 @@
         return DAG.getNode(ISD::FMUL, SDLoc(N), VT,
                            N0.getOperand(0), DAG.getConstantFP(4.0, VT));
     }
+
+    // Canonicalize chains of adds to LHS to simplify the following transform.
+    if (N0.getOpcode() != ISD::FADD && N1.getOpcode() == ISD::FADD)
+      return DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N0);
+    
+    // Convert a chain of 3 dependent operations into 2 independent operations
+    // and 1 dependent operation:
+    //   (fadd x, (fadd y, (fadd z, w))) -> (fadd (fadd x, y), (fadd z, w))
+    if (N0.getOpcode() == ISD::FADD &&  N0.hasOneUse() &&
+        N1.getOpcode() != ISD::FADD) {
+      SDValue N00 = N0.getOperand(0);
+      SDValue N01 = N0.getOperand(1);
+      if (N00.getOpcode() == ISD::FADD) {
+        SDValue NewAdd = DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N01);
+        return DAG.getNode(ISD::FADD, SDLoc(N), VT, N00, NewAdd);
+      }
+    }
   } // enable-unsafe-fp-math
 
   // FADD -> FMA combines:
Index: test/CodeGen/X86/fp-fast.ll
===================================================================
--- test/CodeGen/X86/fp-fast.ll
+++ test/CodeGen/X86/fp-fast.ll
@@ -113,3 +113,46 @@
   %t2 = fadd float %a, %t1
   ret float %t2
 }
+
+; Verify that the first two adds are independent; the destination registers
+; are used as source registers for the third add.
+
+define float @reassociate_adds1(float %a, float %b, float %c, float %d) {
+; CHECK-LABEL: reassociate_adds1:
+; CHECK:       # BB#0:
+; CHECK-NEXT:    vaddss {{%xmm[0-9], %xmm[0-9]}}, [[XMM1:%xmm[0-9]]]
+; CHECK-NEXT:    vaddss {{%xmm[0-9], %xmm[0-9]}}, [[XMM2:%xmm[0-9]]]
+; CHECK-NEXT:    vaddss [[XMM2]], [[XMM1]], 
+; CHECK-NEXT:    retq
+  %add0 = fadd float %a, %b
+  %add1 = fadd float %add0, %c
+  %add2 = fadd float %add1, %d
+  ret float %add2
+}
+
+define float @reassociate_adds2(float %a, float %b, float %c, float %d) {
+; CHECK-LABEL: reassociate_adds2:
+; CHECK:       # BB#0:
+; CHECK-NEXT:    vaddss {{%xmm[0-9], %xmm[0-9]}}, [[XMM1:%xmm[0-9]]]
+; CHECK-NEXT:    vaddss {{%xmm[0-9], %xmm[0-9]}}, [[XMM2:%xmm[0-9]]]
+; CHECK-NEXT:    vaddss [[XMM2]], [[XMM1]], 
+; CHECK-NEXT:    retq
+  %add0 = fadd float %a, %b
+  %add1 = fadd float %c, %add0
+  %add2 = fadd float %add1, %d
+  ret float %add2
+}
+
+define float @reassociate_adds3(float %a, float %b, float %c, float %d) {
+; CHECK-LABEL: reassociate_adds3:
+; CHECK:       # BB#0:
+; CHECK-NEXT:    vaddss {{%xmm[0-9], %xmm[0-9]}}, [[XMM1:%xmm[0-9]]]
+; CHECK-NEXT:    vaddss {{%xmm[0-9], %xmm[0-9]}}, [[XMM2:%xmm[0-9]]]
+; CHECK-NEXT:    vaddss [[XMM2]], [[XMM1]], 
+; CHECK-NEXT:    retq
+  %add0 = fadd float %a, %b
+  %add1 = fadd float %add0, %c
+  %add2 = fadd float %d, %add1
+  ret float %add2
+}
+

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D9232.24326.patch
Type: text/x-patch
Size: 2957 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150423/12d5b74d/attachment.bin>