[PATCH] D28915: [ExecutionDepsFix] Optimize instruction insertion

Keno Fischer via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Thu Feb 9 16:17:39 PST 2017


loladiro added a comment.

There's only really two major changes here:

1. Updating the registers at the end
2. isDependencyBreak support

Updating the registers at the end is not redundant with picking the best register,
because it only operates on instructions where we've determined earlier that a
dependency break is required (i.e. there are no suitable undef registers with
sufficient clearance). What updating at the end does is allow us to avoid the pathological
case in the comment I added above:

  // All registers clobbered here, the code determines it needs a dependency break,
  // so arbitrarily picks xmm0 as the undef read and then remembers for later to
  // insert a dependency break here
  vrandom %xmm0<undef>, %xmm0<def>
  vrandom %xmm1<undef>, %xmm1<def>
  vrandom %xmm2<undef>, %xmm2<def>
  vrandom %xmm3<undef>, %xmm3<def>

The problem is that without this patch this turns into:

  vxorps %xmm0, %xmm0, %xmm0
  vrandom %xmm0<undef>, %xmm0<def>
  vxorps %xmm1, %xmm1, %xmm1
  vrandom %xmm1<undef>, %xmm1<def>
  vxorps %xmm2, %xmm2, %xmm2
  vrandom %xmm2<undef>, %xmm2<def>
  vxorps %xmm3, %xmm3, %xmm3
  vrandom %xmm3<undef>, %xmm3<def>

Clearly, we can do better, which is what this patch does. However, doing so involves
a backwards walk over all the undef reads we know need to be broken, so it's fundamentally
incompatible with the forwards algorithm in pickBestRegister.

> Most of these changes where not part of the original patch you've uploaded in your previous patch version that combined the function calls, so I'm wondering how they got themselves into this patch now? Do we really need them? Have you seen performance improvements by these changes too measuring each one separately (even if they look good "on paper" they might not bring the expected result, they might even make it worse)?

I think I had an earlier version for avoiding redundant xorps in a forward manner, which you may have looked at.
That turned out to be insufficient, and is subsumed by this algorithm. In general this
revision is aimed at limiting the impact of https://reviews.llvm.org/D28786, which without this enhancement,
caused a very large number of vxorps to be inserted.

The isDependencyBreak is mostly included because it makes it possible to easily create registers that the algorithm
recognizes as having lots of clearance (and thus not requiring an extra dependency break), even when clearance is
killed at function entry as https://reviews.llvm.org/D28786 does. After that change it is very rare for there to be registers of sufficient clearance otherwise.
I'd be happy to split that out if you want.



================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:661
+        LowestValid = Reg;
+    } else if (ChoosableRegs.count(Reg) == 1) {
+      if (LowestValid == (unsigned)-1)
----------------
myatsina wrote:
> why do you need this case for?
Only to make sure we actually return the correct LowestValid.


================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:666
+  }
+  return LowestValid;
+}
----------------
myatsina wrote:
> why do we choose lowest register? why not choose a "far away" register (this way we may even be able to save the xor)?
We want to pick an arbitrary register from the choosable set. Picking the lowest one ensures a consistent choice. Picking a far away one doesn't make sense, because to get to this point, we've already determined that no registers are sufficiently far away.


================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:690
+  // before every instruction above. However, what we really want is
+  //    vxorps %xmm3, %xmm3, %xmm3
+  //    vrandom %xmm3<undef>, %xmm0<def>
----------------
myatsina wrote:
> Shouldn't "pickBestRegister" catch these cases? And if not, maybe we've missed an optimization case there.
> I'm not sure we should have this in this form.
> What is the scenario you're trying to optimizer and that "pickBestRegister" doesn't catch?
See the non-inline reply. This is fundamentally a reverse operation.


================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:722
+    for (auto LiveReg : LiveRegSet)
+      ChoosableRegs.erase(LiveReg);
+
----------------
myatsina wrote:
> Aren't you checking  the liverange set in updateChooseableRegs? Why do it here too?
`updateChoosableRegs` only looks at the instructions before which we require dependency breaks.
This kills choosability for any registers that have liveness entirely between two such instructions.


================
Comment at: lib/Target/X86/X86InstrInfo.cpp:7511
+  }
+  if (OutReg)
+    *OutReg = Reg;
----------------
myatsina wrote:
> I see that you write a lot of functions so that they could return 2 arguments and that you've used the same "trick" in other functions as well.
> I don't thing it's a good approach in general. This need of a function to return 2 values generally indicates that you need to extract the common logic somehow or define a new type or do some other form of refactoring refactoring.
> In this case I think "isDependencyBreak" should only return true/false (as the name indicates). If the module asking wants to know which register it is then it can find the dest reg on its own, or we can provide the module a "getDependencyBreakReg". In general it will behave similar to the already existing "isReg()" and "getReg()".
Fair enough (maybe I've been writing too much C code). In general I dislike splitting such logic across two functions since it needs to be kept in sync (e.g. when adding additional dependency breaking instructions to be recognized). How about adding an `getDependencyBreakReg` that just returns 0 or an llvm::Optional, if it's not a dependency breaking instruction. 


================
Comment at: test/CodeGen/X86/break-false-dep.ll:236
 top:
-  tail call void asm sideeffect "", "~{xmm6},~{dirflag},~{fpsr},~{flags}"()
-  tail call void asm sideeffect "", "~{xmm0},~{xmm1},~{xmm2},~{xmm3},~{dirflag},~{fpsr},~{flags}"()
-  tail call void asm sideeffect "", "~{xmm4},~{xmm5},~{xmm7},~{dirflag},~{fpsr},~{flags}"()
+;AVX-LABEL:@clearence
+; This is carefully constructed to force LLVM to materialize a vxorps, which
----------------
myatsina wrote:
> Each feature should have it's own test. You are not supposed to eliminate the original test (unless the original test's logic is no longer true), you're suppose to a small unit test for each feature.
> This test was originally designed to check "pickBestRegister".
It still checks pickBestRegister, but in a way that remains valid after D28786, by making use of the fact that the `fcmp ult double %x, 0.0` materializes a constant `0.0` as a dependency breaking instruction, so pickBestRegister can pick up on that.


================
Comment at: test/CodeGen/X86/break-false-dep.ll:237
+;AVX-LABEL:@clearence
+; This is carefully constructed to force LLVM to materialize a vxorps, which
+; also implicitly breaks the dependency, making it a good candidate for the
----------------
myatsina wrote:
> This test is solely for testing the xor combining you've added? Or is it suppose to test something else?
> I already see you've added a dedicated test for the "xor", so I'm not sure why you've changed this one.
> 
> If it's only for xor combine then you can do a very simple test like this:
> 
> define double @clearence(i64 %arg1, i64 %arg2) {
> // the inline asm calls that make xmm6 as the best choice
> %tmp1 = sitofp i64 %arg1 to double
> %tmp2 = sitofp i64 %arg2 to double
> %tmp3 = add i64 %tmp1, %tmp2
>   ret double %tmp3
> }
> 
> In this case xmm6 should be chosen as the undef for both convert instructions.
> and there should only be one xor
See above


================
Comment at: test/CodeGen/X86/break-false-dep.ll:242
+;AVX: vucomisd  [[XMM1]], %xmm0
+  %0 = fcmp ult double %x, 0.0
+  br i1 %0, label %main, label %fake
----------------
myatsina wrote:
> why did you add this part?
> What is the logic that is tests?
See above


================
Comment at: test/CodeGen/X86/break-false-dep.ll:367
+  ret double %fadd3
+}
----------------
myatsina wrote:
> where are the tests for all the other optimizations you've added?
> The "isBreakingDependency", "register re-picking", etc?
This is the test for register re-picking.


================
Comment at: test/CodeGen/X86/vec_int_to_fp.ll:1668
 ; VEX-NEXT:  # BB#7:
-; VEX-NEXT:    vcvtsi2ssq %rax, %xmm2, %xmm1
+; VEX-NEXT:    vcvtsi2ssq %rax, %xmm1, %xmm1
 ; VEX-NEXT:  .LBB39_8:
----------------
myatsina wrote:
> Why is using xmm1 ok?
> If we fall through from the revius BB where xmm1 is xored, then it's ok but we might jump to this block from another place and if we use xmm1 and won't have a xor, then we will have a stall, no?
> 
As far as I can tell, nothing else jumps in between the vxorps and this undefined use, so it's a good register to use.


https://reviews.llvm.org/D28915





More information about the llvm-commits mailing list