[PATCH] D54095: [X86] X86DAGToDAGISel::matchBitExtract(): extract 'lshr' from `X`

Tue Nov 13 05:41:39 PST 2018

andreadb added a comment.

In general, I tend to prefer a sequence of MOVZ + OR  over a partial register write introduced by a byte move.
That being said, I don't have a strong opinion on this.

AMD processors don't tend to rename portions of a 32/64 bit GPR. So, the partial write introduced by your byte move is not renamed, and there is a false dependency on the input register operand. On the other hand, if you use a MOVZ, you end up with one extra instruction to dispatch/execute. However, in practice the latency of that extra MOVZ can be hidden, as you can see from the throughput numbers generated by llvm-mca for your examples.

Partial register moves may no be a problem on modern Intel cpus.
Intel processors know how to rename portions of a GPR, so false dependencies can be removed that way. According to Agner: since Haswell, `the processor solves this problem without any visible performance penalties. Perhaps it makes dual bookkeeping of both the partial register and the full register.` (cit. "Haswell and Broadwell pipeline" and "Skylake pipeline").

There is still a small penalty for SandyBridge/Ivybridge. On those processors, a partial write can be renamed. However, when the full register is read later on, a merge opcode is issued to join the data from the different writes. That being said, the latency of that merge opcode should be relatively small.

Note however that, according to Agner (see chapter "Sandybridge and Ivybridge pipeline") `a zero-extending move from an 8-bit register to a 32-bit or 64-bit register can also be eliminated` at register renaming stage. That seems to be only true for Sandybridge and Ivybridge. As far as I am aware of, no AMD processor allows zero extending move elimination.
That being said, a solution that uses a  MOVZ + OR may be better (or equivalent) on Sandybridge/Ivybridge.

As a side note: if we can "prove" that the upper bits of the register in input to the MOVZ is known to be zero, then the MOVZ becomes redundant and can be removed. That is another point in favor of MOVZ + OR. But - as I wrote - there is no clear winner.

Regardless of the solution that you decide to implement in the end. It is nice to see that we no longer have the regalloc constraint on physical register %CL introduced by the shift right.

-Andrea

Repository:
  rL LLVM

https://reviews.llvm.org/D54095