[PATCH] D21774: [X86] Transform setcc + movzbl into xorl + setcc

Thu Jun 30 07:16:26 PDT 2016

ab added a comment.

Eh, if Dave has no objections, I'm also fine with the pass;  I was hoping we'd find a better solution, sorry for the noise!

The pseudo does seem like the least brittle approach; but the pre-RA pass seems good enough?
So... have you considered doing it post-RA?  Seems like that lets you avoid the PCMPESTR problem, and I can't think of obvious drawbacks.  It's a tad trickier but should be very close to this patch.

In http://reviews.llvm.org/D21774#469513, @mkuper wrote:

> the call to Select worries me (is it strictly necessary? We're supposed to always select bottom-up, right?)

You're right, I don't think it's necessary.

> It does however expose some of the weirdness Dave was talking about, simply by virtue of happening early and thus having less predictable code at RA time.

>  For CodeGen/X86/legalize-shift-64.ll we will get:

> 

>   ...

>   	movb	$32, %cl

>   	testb	%cl, %cl

>   	sete	%bl

>   	movl	$0, %ecx

>   	movb	%bl, %cl

>   ...

> 

> 

> The block the sete lives in happens to have two different MOV32r0 instructions, one for the sete, and another for an unrelated reason. One of them gets MachineCSE'd, and then regalloc has to remat a 0, and happens to remat it like this.

For the curious: I investigated this some more.  We do have the ability to remat using MOV32r0, but we don't because EFLAGS is live (I swear that liveness up-and-down loop is duplicated, like, a dozen times).  Immediately after that code, there's a jne, so EFLAGS really is live across the rematerialization point, which doesn't sound common.

But that's a separate issue from the INSERT_SUBREG becoming a real copy.  We looked into it with Matthias: on 32-bit, only the ABCD registers are available as sub_8bit, so the INSERT_SUBREG ends up being constrained:

    MOV32mi <fi#0>, 1, %noreg, 0, %noreg, 1; mem:ST4[%x]
    MOV32mi <fi#1>, 1, %noreg, 4, %noreg, 0; mem:ST4[%t+4]
    MOV32mi <fi#1>, 1, %noreg, 0, %noreg, 1; mem:ST4[%t](align=8)
    %vreg0<def> = MOV32ri 1; GR32:%vreg0
    %vreg1<def> = MOV32r0 %EFLAGS<imp-def,dead>; GR32:%vreg1
    %vreg2<def,tied1> = SHLD32rri8 %vreg1<tied0>, %vreg0, 32, %EFLAGS<imp-def,dead>; GR32:%vreg2,%vreg1,%vreg0
    %vreg3<def> = MOV32r0 %EFLAGS<imp-def,dead>; GR32:%vreg3
    %vreg4<def> = MOV8ri 32; GR8:%vreg4
    TEST8rr %vreg4, %vreg4, %EFLAGS<imp-def>; GR8:%vreg4
    %vreg5<def> = SETEr %EFLAGS<imp-use>; GR8:%vreg5
  ## GR32_ABCD:
    %vreg6<def,tied1> = INSERT_SUBREG %vreg3<tied0>, %vreg5<kill>, sub_8bit; GR32_ABCD:%vreg6 GR32:%vreg3 GR8:%vreg5
    %vreg7<def> = CMOV_GR32 %vreg2<kill>, %vreg0, 9, %EFLAGS<imp-use>; GR32:%vreg7,%vreg2,%vreg0
    %vreg8<def,tied1> = XOR32ri8 %vreg7<tied0>, 1, %EFLAGS<imp-def,dead>; GR32:%vreg8,%vreg7
    %vreg9<def,tied1> = OR32rr %vreg6<tied0>, %vreg8<kill>, %EFLAGS<imp-def>; GR32:%vreg9,%vreg8 GR32_ABCD:%vreg6
    JE_1 <BB#2>, %EFLAGS<imp-use>
    JMP_1 <BB#1>

Once the MOV32r0 is CSE'd, we end up not reusing the (unconstrained GR32) vreg, because of that constraint; we end up with a copy after 2-addr:

    MOV32mi <fi#0>, 1, %noreg, 0, %noreg, 1; mem:ST4[%x]
    MOV32mi <fi#1>, 1, %noreg, 4, %noreg, 0; mem:ST4[%t+4]
    MOV32mi <fi#1>, 1, %noreg, 0, %noreg, 1; mem:ST4[%t](align=8)
    %vreg0<def> = MOV32ri 1; GR32:%vreg0
    %vreg1<def> = MOV32r0 %EFLAGS<imp-def,dead>; GR32:%vreg1
    %vreg4<def> = MOV8ri 32; GR8:%vreg4
    TEST8rr %vreg4<kill>, %vreg4, %EFLAGS<imp-def>; GR8:%vreg4
    %vreg5<def> = SETEr %EFLAGS<imp-use>; GR8:%vreg5
  ## COPY to GR32_ABCD:
    %vreg6<def> = COPY %vreg1; GR32_ABCD:%vreg6 GR32:%vreg1
    %vreg6:sub_8bit<def> = COPY %vreg5<kill>; GR32_ABCD:%vreg6 GR8:%vreg5
    JE_1 <BB#1>, %EFLAGS<imp-use,kill>

And that's how we get the worst-case sequence.
So, we thought it was fine if it was more likely only on 32-bit, but now we have one 64-bit example in PR28146.  I'd be interested to investigate that too, but that's only for curiosity; let's forget about the ISel approach.

http://reviews.llvm.org/D21774