[PATCH] D11428: Improve EmitLoweredSelect for contiguous pseudo CMOV instructions.

Fri Jul 31 12:14:26 PDT 2015

kbsmith1 added a comment.

Thanks for the additional comments.  See some explanations inline.  I'll see if I can improve a few of the tests to test for operand orders, but that isn't generally doable on all the tests due to the complexities of how downstream affects the code.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:20074-20075
@@ +20073,4 @@
+    // Add this PHI to the rewrite table.
+    RegRewriteTable[DestReg].first = Op1Reg;
+    RegRewriteTable[DestReg].second = Op2Reg;
+  }
----------------
ab wrote:
> Merge both lines into
> 
>     RegRewriteTable[DestReg] = std::make_pair(Op1Reg, Op2Reg)
> 
> ?
OK.

================
Comment at: test/CodeGen/X86/pseudo_cmov_lower.ll:2-3
@@ +1,4 @@
+; RUN: llc < %s -mcpu=pentium -mtriple=i386-linux-gnu -o - | FileCheck %s 
+; RUN: llc < %s -mcpu=pentium4 -mattr=-cmov -mtriple=i386-linux-gnu -o - | FileCheck %s 
+
+; This test checks that only a single js gets generated in the final code
----------------
ab wrote:
> > This specific test is meant to test the 32 bit CPU case, as that is where most of the CMOV pseudos occur. For this specific test, the few FP things in here need to use 32 bit cpu to test RFP register type CMOV pseudos.
> 
> Makes sense
> 
> > pseudo_cmov_lower1.ll tests the SSE/SSE2 type pseudo CMOVs (for 32 bit cpu) by using -mcpu=pentium, which I thought would be nice to explicitly say that this is a CPU without CMOV support. I also wanted to make sure that the test worked properly when a CPU with CMOV support was used, but when cmov was turned off explicitly.
> 
> That sounds like testing -mattr and CPU features, which should be done elsewhere, no?  Here, if all you want is to assert that we're compiling for a target without CMOV, -mattr=-cmov does exactly that (but could be left out with just i386).  IMO, explicit CPUs should only be used when the CPU actually matters (e.g. when testing scheduling models?).
> 
> > I can add a x86_64 test based on pseudo_cmov_lower1.ll, or perhaps just as another run line in that test.
> 
> Great!
> 
> > I don't understand why x86_64 will make the tests any more readable, can you elaborate on that?
> 
> Only because of the less noisy ABI, really.  For tests like foo4/foo5, with very explicit checks.
OK, I'll just use i386-linux-gnu rather than the -mcpu options.

================
Comment at: test/CodeGen/X86/pseudo_cmov_lower.ll:8
@@ +7,3 @@
+; CHECK: js
+; CHECK-NOT: js
+define i32 @foo1(i32 %v1, i32 %v2, i32 %v3) nounwind {
----------------
ab wrote:
> Right now the tests check for a specific number of branches.  I'm also interested in the flow of data and register assignments, leading to the final return value.  Basically, the various MOVs in each block, and the final SUB.
> 
> Consider, for instance, a bug where one of the PHIs has inverted operands (or even the feature where opposite CCs lead to inverted PHI operands): we should test for that.
> 
> By the way, I was testing this out when I noticed that this:
> 
> 
> ```
> define i32 @foo(i32 %v0, i32 %v1, i32 %v2, i32 %v3, i32 %v4) nounwind {
> entry:
>   %cmp = icmp slt i32 %v0, 0  ;; <- %v0 disables the opt,  %v1 enables it
>   %v3.v4 = select i1 %cmp, i32 %v3, i32 %v4
>   %v1.v2 = select i1 %cmp, i32 %v1, i32 %v2
>   %sub = sub i32 %v1.v2, %v3.v4
>   ret i32 %sub
> }
> ```
> 
> seems to not trigger the optimization (I get JS twice).  Shouldn't it?
Regarding the first part.  It is very difficult to test that the PHIs operands are in the correct order (well, really that they come from the proper BB. The code seen effectively after the transform looks like:

BB1: cmp
        jns BB3:
BB2: // empty
BB3: phi op1(from BB1), op2(from BB2)
         phi op11(from BB1), op12(from BB2)

The actual assembly generated for the movs is fairly tricky, and very dependent on the whims of register allocation.  I didn't put more specific tests for operands in, because that will tend to make the tests very brittle as downstream passes are changed.  In fact depending on the moves that neede to be inserted, down stream can introduce an else block effectively by making BB2 jmp to BB3, and retargeting the JNS to the else block. So, order of the operands in the later instructions generated is very dependent on down stream passes, and not on this change itself.

Now to the second part of the comment. Yes, it looks like this should have hit the optimization, but it doesn't for an interesting reason.  select's of two memory operands ends up being lowered into a select of two LEAs that represent the address of the two memory operands, and then a single dereference of the selected pointer.  This means you don't have to speculate a load, or increase the number of loads n the program.  The actual IR for the example in question when you get to EmitLoweredSelect is this:

(gdb) print BB->dump()
BB#0: derived from LLVM BB %entry
        CMP32mi8 <fi#-1>, 1, %noreg, 0, %noreg, 0, %EFLAGS<imp-def>; mem:LD4[FixedStack-1](align=16)
        %vreg0<def> = LEA32r <fi#-4>, 1, %noreg, 0, %noreg; GR32:%vreg0
        %vreg1<def> = LEA32r <fi#-5>, 1, %noreg, 0, %noreg; GR32:%vreg1
        %vreg2<def> = CMOV_GR32 %vreg1<kill>, %vreg0<kill>, 15, %EFLAGS<imp-use> ;GR32:%vreg2,%vreg1,%vreg0
        %vreg3<def> = LEA32r <fi#-2>, 1, %noreg, 0, %noreg; GR32:%vreg3
        %vreg4<def> = LEA32r <fi#-3>, 1, %noreg, 0, %noreg; GR32:%vreg4
        %vreg5<def> = CMOV_GR32 %vreg4<kill>, %vreg3<kill>, 15, %EFLAGS<imp-use>; GR32:%vreg5,%vreg4,%vreg3
        %vreg6<def> = MOV32rm %vreg5<kill>, 1, %noreg, 0, %noreg; mem:LD4[<unknown>] GR32:%vreg6,%vreg5
        %vreg7<def,tied1> = SUB32rm %vreg6<tied0>, %vreg2<kill>, 1, %noreg, 0, %noreg, %EFLAGS<imp-def,dead>; mem:LD4[<unknown>] GR32:%vreg7,%vreg6,%vreg2
        %EAX<def> = COPY %vreg7; GR32:%vreg7
        RETL %EAX
$1 = void

As you can see the pseudo CMOVs are not contiguous, and thus this new code doesn't apply.

http://reviews.llvm.org/D11428