[llvm] r195496 - X86: Perform integer comparisons at i32 or larger.

Sat Dec 28 18:36:43 PST 2013

Hi Chandler,

On Dec 28, 2013, at 3:39 PM, Chandler Carruth <chandlerc at google.com> wrote:

> But because we read it with the 'b' suffix, we don't actually read the high bytes! So we shouldn't be seeing a stall on the read side. So it seems like (at least on sandybrige) there is merging of partial registers on *write* even without a read. Then the movb would be enough to cause the slowdown regardless of how we read it. This at least explains the measured performance shifts I see on my sandybridge. When I take the original assembly for all of bzip2, and I modify just these movb instructions to be movzbl, I see roughly the same performance uplift as we get with ToT and Jim's patch.

I think you’ve found something interesting here, but I think your analysis of what’s going on architecturally isn’t quite accurate.  The Intel optimization manual notes that that modern architectures incur only very small partial register stalls.  What’s more likely is that the partial register write is causing a serialization of the out-of-order core.  Consider some assembly code like this:

addl r8d, r9d, %eax
...
movb (%rcx), %al
cmpb %al, %bl

Assuming the machine does remaining at a >8b granularity, it cannot issue the load until the add completes, because it needs to have the values available for the high bytes of %eax.  Thus the snippet will take at least three cycles to complete.

Now consider this version:

addl r8d, r9d, %eax
...
movzbl (%rcx), %eax
cmpb %al, %bl

Here, the result of the load does not depend on the add, and can be issued in parallel.  The minimal time to execute is only two cycles.  Note that the delta between the good case and the bad case is exacerbated if the preceding dependencies was a high-latency operation.  One interesting experiment would be to see if inserting an xorl before movb also recovers the performance.

The optimization you’re looking to do here is very similar to something we did on ARM, where partial register dependencies via the predicate register are a significant problem.  Jakob and Andy designed a solution (based on MachineTraceMetrics, I think?) for that instance where we were able to detect false dependencies on high-latency operations and insert dependency breaking instructions in between.  I’ve CC’d them in in hopes that they can explain a bit how it works.

—Owen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20131228/82bf41a8/attachment.html>