[PATCH] D11428: Improve EmitLoweredSelect for contiguous pseudo CMOV instructions.

Fri Jul 31 12:53:56 PDT 2015

ab added inline comments.

================
Comment at: test/CodeGen/X86/pseudo_cmov_lower.ll:8
@@ +7,3 @@
+; CHECK: js
+; CHECK-NOT: js
+define i32 @foo1(i32 %v1, i32 %v2, i32 %v3) nounwind {
----------------
kbsmith1 wrote:
> ab wrote:
> > Right now the tests check for a specific number of branches.  I'm also interested in the flow of data and register assignments, leading to the final return value.  Basically, the various MOVs in each block, and the final SUB.
> > 
> > Consider, for instance, a bug where one of the PHIs has inverted operands (or even the feature where opposite CCs lead to inverted PHI operands): we should test for that.
> > 
> > By the way, I was testing this out when I noticed that this:
> > 
> > 
> > ```
> > define i32 @foo(i32 %v0, i32 %v1, i32 %v2, i32 %v3, i32 %v4) nounwind {
> > entry:
> >   %cmp = icmp slt i32 %v0, 0  ;; <- %v0 disables the opt,  %v1 enables it
> >   %v3.v4 = select i1 %cmp, i32 %v3, i32 %v4
> >   %v1.v2 = select i1 %cmp, i32 %v1, i32 %v2
> >   %sub = sub i32 %v1.v2, %v3.v4
> >   ret i32 %sub
> > }
> > ```
> > 
> > seems to not trigger the optimization (I get JS twice).  Shouldn't it?
> Regarding the first part.  It is very difficult to test that the PHIs operands are in the correct order (well, really that they come from the proper BB. The code seen effectively after the transform looks like:
> 
> BB1: cmp
>         jns BB3:
> BB2: // empty
> BB3: phi op1(from BB1), op2(from BB2)
>          phi op11(from BB1), op12(from BB2)
> 
> The actual assembly generated for the movs is fairly tricky, and very dependent on the whims of register allocation.  I didn't put more specific tests for operands in, because that will tend to make the tests very brittle as downstream passes are changed.  In fact depending on the moves that neede to be inserted, down stream can introduce an else block effectively by making BB2 jmp to BB3, and retargeting the JNS to the else block. So, order of the operands in the later instructions generated is very dependent on down stream passes, and not on this change itself.
> 
> Now to the second part of the comment. Yes, it looks like this should have hit the optimization, but it doesn't for an interesting reason.  select's of two memory operands ends up being lowered into a select of two LEAs that represent the address of the two memory operands, and then a single dereference of the selected pointer.  This means you don't have to speculate a load, or increase the number of loads n the program.  The actual IR for the example in question when you get to EmitLoweredSelect is this:
> 
> (gdb) print BB->dump()
> BB#0: derived from LLVM BB %entry
>         CMP32mi8 <fi#-1>, 1, %noreg, 0, %noreg, 0, %EFLAGS<imp-def>; mem:LD4[FixedStack-1](align=16)
>         %vreg0<def> = LEA32r <fi#-4>, 1, %noreg, 0, %noreg; GR32:%vreg0
>         %vreg1<def> = LEA32r <fi#-5>, 1, %noreg, 0, %noreg; GR32:%vreg1
>         %vreg2<def> = CMOV_GR32 %vreg1<kill>, %vreg0<kill>, 15, %EFLAGS<imp-use> ;GR32:%vreg2,%vreg1,%vreg0
>         %vreg3<def> = LEA32r <fi#-2>, 1, %noreg, 0, %noreg; GR32:%vreg3
>         %vreg4<def> = LEA32r <fi#-3>, 1, %noreg, 0, %noreg; GR32:%vreg4
>         %vreg5<def> = CMOV_GR32 %vreg4<kill>, %vreg3<kill>, 15, %EFLAGS<imp-use>; GR32:%vreg5,%vreg4,%vreg3
>         %vreg6<def> = MOV32rm %vreg5<kill>, 1, %noreg, 0, %noreg; mem:LD4[<unknown>] GR32:%vreg6,%vreg5
>         %vreg7<def,tied1> = SUB32rm %vreg6<tied0>, %vreg2<kill>, 1, %noreg, 0, %noreg, %EFLAGS<imp-def,dead>; mem:LD4[<unknown>] GR32:%vreg7,%vreg6,%vreg2
>         %EAX<def> = COPY %vreg7; GR32:%vreg7
>         RETL %EAX
> $1 = void
> 
> As you can see the pseudo CMOVs are not contiguous, and thus this new code doesn't apply.
> 
> The actual assembly generated for the movs is fairly tricky, and very dependent on the whims of register allocation. I didn't put more specific tests for operands in, because that will tend to make the tests very brittle as downstream passes are changed. In fact depending on the moves that neede to be inserted, down stream can introduce an else block effectively by making BB2 jmp to BB3, and retargeting the JNS to the else block. So, order of the operands in the later instructions generated is very dependent on down stream passes, and not on this change itself.

I realize that, and this is a shortcoming of our test infrastructure that the MI serialization effort intends to address.  Still, right now, in my opinion, overly explicit tests are better than no tests.

So, my (radical) advice is: use the utils/update_llc_test_checks.py script, and tweak it locally to remove the SP-offset scrubs (l46).  You'll probably want to get rid of foo9, and simplify foo8 (just focus on two types rather than the combination of all?).  Also, I think the reuse of %v2 might make the output harder to follow than they would be if you had v1.v2 and v3.v4, but that's just a gut feeling.  When someone changes anything downstream that affects this, they can just run the script again, and have a nice diff of what changed.

What do you think?  If you can come up with a cleaner way, that'd be great!  But I'm uncomfortable with no testing at all :/

> Now to the second part of the comment. Yes, it looks like this should have hit the optimization, but it doesn't for an interesting reason. [...]

Thanks for investigating the example!

http://reviews.llvm.org/D11428