[llvm] r195496 - X86: Perform integer comparisons at i32 or larger.

Sun Dec 29 14:53:36 PST 2013

On Dec 28, 2013, at 6:36 PM, Owen Anderson <resistor at mac.com> wrote:

> Hi Chandler,
> 
> On Dec 28, 2013, at 3:39 PM, Chandler Carruth <chandlerc at google.com> wrote:
> 
>> But because we read it with the 'b' suffix, we don't actually read the high bytes! So we shouldn't be seeing a stall on the read side. So it seems like (at least on sandybrige) there is merging of partial registers on *write* even without a read. Then the movb would be enough to cause the slowdown regardless of how we read it. This at least explains the measured performance shifts I see on my sandybridge. When I take the original assembly for all of bzip2, and I modify just these movb instructions to be movzbl, I see roughly the same performance uplift as we get with ToT and Jim's patch.
> 
> 
> I think you’ve found something interesting here, but I think your analysis of what’s going on architecturally isn’t quite accurate.  The Intel optimization manual notes that that modern architectures incur only very small partial register stalls.  What’s more likely is that the partial register write is causing a serialization of the out-of-order core.  Consider some assembly code like this:
> 
> addl r8d, r9d, %eax
> ...
> movb (%rcx), %al
> cmpb %al, %bl
> 
> Assuming the machine does remaining at a >8b granularity, it cannot issue the load until the add completes, because it needs to have the values available for the high bytes of %eax.  Thus the snippet will take at least three cycles to complete.
> 
> Now consider this version:
> 
> addl r8d, r9d, %eax
> ...
> movzbl (%rcx), %eax
> cmpb %al, %bl
> 
> Here, the result of the load does not depend on the add, and can be issued in parallel.  The minimal time to execute is only two cycles.  Note that the delta between the good case and the bad case is exacerbated if the preceding dependencies was a high-latency operation.  One interesting experiment would be to see if inserting an xorl before movb also recovers the performance.
> 
> The optimization you’re looking to do here is very similar to something we did on ARM, where partial register dependencies via the predicate register are a significant problem.  Jakob and Andy designed a solution (based on MachineTraceMetrics, I think?) for that instance where we were able to detect false dependencies on high-latency operations and insert dependency breaking instructions in between.  I’ve CC’d them in in hopes that they can explain a bit how it works.
> 
> —Owen

To confirm Owen's analysis I can tell you that bzip2 on x86 appears highly sensitive to register allocation with a jitter of ~3%. (Although it seems to contradict Agner's statements that SandyBridge renames partial registers). Either way it would be great to finally fix this.

Regarding partial dependece fixing passes, Jakob recently added an optimization to avoid partial flag setting instructions in the thumb2 size reduction pass (there may be other, older attempts that I don’t remember—maybe handling S regs?). Being target specific, the partial CPSR fix just hard codes a couple of high latency opcodes. It should be easy to use the generic TargetSchedModel API instead.

MachineTraceMetrics goes beyond latency and gives you critical path/slack metrics. It can be useful when the tradeoffs are complicated. It might be overkill for a simple heuristic.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20131229/2281fd8a/attachment.html>