[PATCH] D28786: [ExecutionDepsFix] Kill clearance at function entry/calls

Wed Feb 15 08:02:14 PST 2017

myatsina added inline comments.

================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:631
     }
+  } else if (MI->isCall()) {
+    // If this is a call, pretend all registers we are considering are def'd
----------------
loladiro wrote:
> myatsina wrote:
> > Here, you've made this version dependent on the D28915 review.
> > I suggest you commit this patch before D28915, but then you need to change this condition a bit (else if --> if).
> > 
> > You can change it back in the "isDependencyBreak" review.
> Without D28915 there several hundred more xorps to be inserted in the tests ;)
> 
Good point :)

================
Comment at: test/CodeGen/X86/avx512-cvt.ll:19
 ; KNL-NEXT:    vpextrq $1, %xmm1, %rax
-; KNL-NEXT:    vcvtsi2sdq %rax, %xmm2, %xmm2
+; KNL-NEXT:    vxorps %xmm4, %xmm4, %xmm4
+; KNL-NEXT:    vcvtsi2sdq %rax, %xmm4, %xmm2
----------------
loladiro wrote:
> myatsina wrote:
> > I think the checks in this test mix some of of your separate changes (the "XOR" fusion?).
> > Are you sure this tst is correct? Shouldn't it have 2 xors? also, I'm not sure why xmm2, xmm3 changed to xmm4 here.
> Yes, this relies on the optimization in D28915. With this patch, but without the other one it would have inserted three xorps here (one for each of xmm2, xmm3, xmm4). The optimization in that patch allows us to get away with one.
> 
> The reason this changed here and not in the other patch is that before this, xmm2 was considered to have sufficient clearance (same for xmm3 and xmm4), but now it doesn't any more.
Why doesn't xmm2 have enough clearance and xmm4 does?
There is no loop here, xmm4 and xmm2 should have the same clearance at this point.
Or is it getting together with the other optimization you did that merges a bunch of instructions and re-writes their undef register?

This is why I want to keep all these changes you are making separate from each other as much as possible.
I understand they are ALL needed for avoiding the regressions, but they have complex interaction with each other therefore I think it's crucial to do them separately and incrementally. It will also help "documentation" of the different parts of this mechanism.

================
Comment at: test/CodeGen/X86/break-false-dep.ll:371
+  %idx = phi i32 [0, %entry], [%idx, %loop]
+  %valptr = getelementptr i64, i64* %y, i32 %idx
+  %valptr2 = getelementptr i64, i64* %z, i32 %idx
----------------
loladiro wrote:
> myatsina wrote:
> > A suggestion for a simpler test that tests the call feature:
> > 
> > define <4 x double> @callclearance( i64 %val ) {
> > // inline asm to force choosing one register
> >  %val1 = sitofp i64 %val to double
> >  %val2 = sitofp i64 %val to double
> >   %2 = call <4 x double> @myfunc(i64 %val, i64 %val1)
> >  %val3 = sitofp i64 %val to double
> > ...
> > }
> > 
> > Show that the undef register of the first and the second convert are the same, but the undef register of the third convert is different from them (--> the call instruction changes the clearance and thus another register was chosen).
> > 
> What kind of inline asm do you have in mind for this purpose? I tried for a bit to come up with something, but didn't manage to.
The inline asm itself can be nop or empty string, it is treated as a black box for most of the compiler and nobody actually analyzes the registers that are used there. The important part is to mark the relevant registers as clobbered by the asm call (equivelent to the clobber list in extended inline asm):
tail call void asm sideeffect "", "~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{dirflag},~{fpsr},~{flags}"()

Here the inline asm is "" and it marks xmm8-xmm11 as clobbered, so this instruction will mark these registers as defs for clearance calculation purpose.

================
Comment at: test/CodeGen/X86/i64-to-float.ll:281
 ; X64-AVX-NEXT:    vpextrq $1, %xmm0, %rax
-; X64-AVX-NEXT:    vcvtsi2sdq %rax, %xmm3, %xmm1
+; X64-AVX-NEXT:    vxorps %xmm2, %xmm2, %xmm2
+; X64-AVX-NEXT:    vcvtsi2sdq %rax, %xmm2, %xmm1
----------------
loladiro wrote:
> myatsina wrote:
> > why the change to xmm2?
> > There is a dependency on it "really close", 3 instructions above it (vpcmpgtq %xmm1, %xmm0, %xmm2)
> Well, there needs to be an xor with some register here. xmm2 seems as good as any other, unless the fact that is was used shortly before actually makes a difference (I'm not aware of such an impact, since this should be handle in the register renaming unit, but happy to be corrected). 
What I meant in my comment is the original choice of xmm3 seems to be better than the new choice of xmm2.
In theory if you prefer far away register, you may find a register that is far enough so that you wouldn't have to insert a xor and by this save an instruction.

Repository:
  rL LLVM

https://reviews.llvm.org/D28786