[PATCH] D54095: [X86] X86DAGToDAGISel::matchBitExtract(): extract 'lshr' from `X`

Tue Nov 13 07:10:47 PST 2018

andreadb accepted this revision.
andreadb added a comment.
This revision is now accepted and ready to land.

In https://reviews.llvm.org/D54095#1296822, @andreadb wrote:

> In general, I tend to prefer a sequence of MOVZ + OR  over a partial register write introduced by a byte move.
>  That being said, I don't have a strong opinion on this.
>
> AMD processors don't tend to rename portions of a 32/64 bit GPR. So, the partial write introduced by your byte move is not renamed, and there is a false dependency on the input register operand. On the other hand, if you use a MOVZ, you end up with one extra instruction to dispatch/execute. However, in practice the latency of that extra MOVZ can be hidden, as you can see from the throughput numbers generated by llvm-mca for your examples.
>
> Partial register moves may no be a problem on modern Intel cpus.
>  Intel processors know how to rename portions of a GPR, so false dependencies can be removed that way. According to Agner: since Haswell, `the processor solves this problem without any visible performance penalties. Perhaps it makes dual bookkeeping of both the partial register and the full register.` (cit. "Haswell and Broadwell pipeline" and "Skylake pipeline").
>
> There is still a small penalty for SandyBridge/Ivybridge. On those processors, a partial write can be renamed. However, when the full register is read later on, a merge opcode is issued to join the data from the different writes. That being said, the latency of that merge opcode should be relatively small.
>
> Note however that, according to Agner (see chapter "Sandybridge and Ivybridge pipeline") `a zero-extending move from an 8-bit register to a 32-bit or 64-bit register can also be eliminated` at register renaming stage. That seems to be only true for Sandybridge and Ivybridge. As far as I am aware of, no AMD processor allows zero extending move elimination.
>  That being said, a solution that uses a  MOVZ + OR may be better (or equivalent) on Sandybridge/Ivybridge.

Acutally. BMI1/BMI2 has been introduced since Haswell (thanks Simon for pointing it out). So, the information about Sandybridge/Ivybridge is not relevant to this patch.

In conclusion:
I think your patch is fine, and it removes a not-so-nice dependency from physreg %CL.
An alternative approach that uses MOVZ+OR instead of a MOV8rr is not bad. In particular, if the compiler knows how to optimize out the MOVZ by using the "known-zero-bits", then we may want to produce an OR instead of a partial register move.

-Andrea

> Regardless of the solution that you decide to implement in the end. It is nice to see that we no longer have the regalloc constraint on physical register %CL introduced by the shift right.
> 
> -Andrea

Repository:
  rL LLVM

https://reviews.llvm.org/D54095