[PATCH] D28786: [ExecutionDepsFix] Kill clearance at function entry/calls

Wed Feb 15 13:12:22 PST 2017

loladiro added inline comments.

================
Comment at: test/CodeGen/X86/avx512-cvt.ll:19
 ; KNL-NEXT:    vpextrq $1, %xmm1, %rax
-; KNL-NEXT:    vcvtsi2sdq %rax, %xmm2, %xmm2
+; KNL-NEXT:    vxorps %xmm4, %xmm4, %xmm4
+; KNL-NEXT:    vcvtsi2sdq %rax, %xmm4, %xmm2
----------------
myatsina wrote:
> loladiro wrote:
> > myatsina wrote:
> > > I think the checks in this test mix some of of your separate changes (the "XOR" fusion?).
> > > Are you sure this tst is correct? Shouldn't it have 2 xors? also, I'm not sure why xmm2, xmm3 changed to xmm4 here.
> > Yes, this relies on the optimization in D28915. With this patch, but without the other one it would have inserted three xorps here (one for each of xmm2, xmm3, xmm4). The optimization in that patch allows us to get away with one.
> > 
> > The reason this changed here and not in the other patch is that before this, xmm2 was considered to have sufficient clearance (same for xmm3 and xmm4), but now it doesn't any more.
> Why doesn't xmm2 have enough clearance and xmm4 does?
> There is no loop here, xmm4 and xmm2 should have the same clearance at this point.
> Or is it getting together with the other optimization you did that merges a bunch of instructions and re-writes their undef register?
> 
> This is why I want to keep all these changes you are making separate from each other as much as possible.
> I understand they are ALL needed for avoiding the regressions, but they have complex interaction with each other therefore I think it's crucial to do them separately and incrementally. It will also help "documentation" of the different parts of this mechanism.
Without the change in D28915, this diff would have been:
```
 ; KNL-NEXT:    vextracti32x4 $3, %zmm0, %xmm1
 ; KNL-NEXT:    vpextrq $1, %xmm1, %rax
+; KNL-NEXT:    vxorps %xmm2, %xmm2, %xmm2
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm2, %xmm2
 ; KNL-NEXT:    vmovq %xmm1, %rax
+; KNL-NEXT:    vxorps %xmm3, %xmm3, %xmm3
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm3, %xmm1
 ; KNL-NEXT:    vunpcklpd {{.*#+}} xmm1 = xmm1[0],xmm2[0]
 ; KNL-NEXT:    vextracti32x4 $2, %zmm0, %xmm2
 ; KNL-NEXT:    vpextrq $1, %xmm2, %rax
+; KNL-NEXT:    vxorps %xmm3, %xmm3, %xmm3
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm3, %xmm3
 ; KNL-NEXT:    vmovq %xmm2, %rax
+; KNL-NEXT:    vxorps %xmm4, %xmm4, %xmm4
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm4, %xmm2
 ; KNL-NEXT:    vunpcklpd {{.*#+}} xmm2 = xmm2[0],xmm3[0]
 ; KNL-NEXT:    vinsertf128 $1, %xmm1, %ymm2, %ymm1
 ; KNL-NEXT:    vextracti32x4 $1, %zmm0, %xmm2
 ; KNL-NEXT:    vpextrq $1, %xmm2, %rax
+; KNL-NEXT:    vxorps %xmm4, %xmm4, %xmm4
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm4, %xmm3
 ; KNL-NEXT:    vmovq %xmm2, %rax
+; KNL-NEXT:    vxorps %xmm4, %xmm4, %xmm4
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm4, %xmm2
 ; KNL-NEXT:    vunpcklpd {{.*#+}} xmm2 = xmm2[0],xmm3[0]
 ; KNL-NEXT:    vpextrq $1, %xmm0, %rax
+; KNL-NEXT:    vxorps %xmm4, %xmm4, %xmm4
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm4, %xmm3
 ; KNL-NEXT:    vmovq %xmm0, %rax
+; KNL-NEXT:    vxorps %xmm4, %xmm4, %xmm4
 ; KNL-NEXT:    vcvtsi2sdq %rax, %xmm4, %xmm0
 ; KNL-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm3[0]
 ; KNL-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
 ; KNL-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
 ; KNL-NEXT:    retq
```
The optimization in D28915 allows it to recognize that it only needs on vxorps.

================
Comment at: test/CodeGen/X86/break-false-dep.ll:269
+;AVX-NEXT: vucomiss [[XMM4_7]], %xmm0
+  %0 = fcmp ult float %z, 0.0
+  br i1 %0, label %loop, label %fake
----------------
myatsina wrote:
> loladiro wrote:
> > myatsina wrote:
> > > Why the change? What are you testing here?
> > As remarked in the other revision, this is a trick for making sure that pickBestRegister is still tested.
> Well, you're no longer just testing pickBestReg, you're also testing the "xor fusion" which beats the purpose of unit tests.
> 
> Why do you think this test didn't test "pickBestReg" after your change?
> But this test does need to be changes to expect a "xor" before the convert (which should be the only effect of your change on this test). Or is there some other way you've affected this test that I'm missing?
Because after this change, there's no register with sufficient clearance (since all clearances are killed at function entry, so there's no "best register" for it to pick - they're all bad). This construction (comparison against a constant 0) forces llvm to materialize a vxorps, which breaks the dependence, so pickBestReg actually has some work to do.

================
Comment at: test/CodeGen/X86/break-false-dep.ll:371
+  %idx = phi i32 [0, %entry], [%idx, %loop]
+  %valptr = getelementptr i64, i64* %y, i32 %idx
+  %valptr2 = getelementptr i64, i64* %z, i32 %idx
----------------
myatsina wrote:
> loladiro wrote:
> > myatsina wrote:
> > > A suggestion for a simpler test that tests the call feature:
> > > 
> > > define <4 x double> @callclearance( i64 %val ) {
> > > // inline asm to force choosing one register
> > >  %val1 = sitofp i64 %val to double
> > >  %val2 = sitofp i64 %val to double
> > >   %2 = call <4 x double> @myfunc(i64 %val, i64 %val1)
> > >  %val3 = sitofp i64 %val to double
> > > ...
> > > }
> > > 
> > > Show that the undef register of the first and the second convert are the same, but the undef register of the third convert is different from them (--> the call instruction changes the clearance and thus another register was chosen).
> > > 
> > What kind of inline asm do you have in mind for this purpose? I tried for a bit to come up with something, but didn't manage to.
> The inline asm itself can be nop or empty string, it is treated as a black box for most of the compiler and nobody actually analyzes the registers that are used there. The important part is to mark the relevant registers as clobbered by the asm call (equivelent to the clobber list in extended inline asm):
> tail call void asm sideeffect "", "~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{dirflag},~{fpsr},~{flags}"()
> 
> Here the inline asm is "" and it marks xmm8-xmm11 as clobbered, so this instruction will mark these registers as defs for clearance calculation purpose.
> 
Yes, but that's the opposite of what we need. We'd need some inline assembly to revive clearances (because they're now dead-by-default), not kill them.

================
Comment at: test/CodeGen/X86/i64-to-float.ll:281
 ; X64-AVX-NEXT:    vpextrq $1, %xmm0, %rax
-; X64-AVX-NEXT:    vcvtsi2sdq %rax, %xmm3, %xmm1
+; X64-AVX-NEXT:    vxorps %xmm2, %xmm2, %xmm2
+; X64-AVX-NEXT:    vcvtsi2sdq %rax, %xmm2, %xmm1
----------------
myatsina wrote:
> loladiro wrote:
> > myatsina wrote:
> > > why the change to xmm2?
> > > There is a dependency on it "really close", 3 instructions above it (vpcmpgtq %xmm1, %xmm0, %xmm2)
> > Well, there needs to be an xor with some register here. xmm2 seems as good as any other, unless the fact that is was used shortly before actually makes a difference (I'm not aware of such an impact, since this should be handle in the register renaming unit, but happy to be corrected). 
> What I meant in my comment is the original choice of xmm3 seems to be better than the new choice of xmm2.
> In theory if you prefer far away register, you may find a register that is far enough so that you wouldn't have to insert a xor and by this save an instruction.
> 
Why does xmm3 seem better? At this point the compiler has already determined that no register has sufficient clearance, which is why it inserts the dependency break. As far as I know, which register it uses then doesn't matter.

Repository:
  rL LLVM

https://reviews.llvm.org/D28786