[llvm] r195496 - X86: Perform integer comparisons at i32 or larger.

Tue Jan 7 21:02:49 PST 2014

On Jan 6, 2014, at 5:33 PM, Chandler Carruth <chandlerc at google.com> wrote:

> 
> On Sun, Dec 29, 2013 at 5:53 PM, Andrew Trick <atrick at apple.com> wrote:
> To confirm Owen's analysis I can tell you that bzip2 on x86 appears highly sensitive to register allocation with a jitter of ~3%. (Although it seems to contradict Agner's statements that SandyBridge renames partial registers).
> 
> The more measurements I do, the more I come to believe that at least on SandyBridge what is happening is this:
> 
> movb (...,%rax), %al
> 
> We have a partial write and a full read of %rax. Even though this is reading a clobber of %rax, this still ends up paying *some* of the cost of the partial register write, but not all of it. My guess is that this is the opcode for the merge rather than a stall due to a dependency, but I'm just guessing at that point. What seems consistent is that if the destination register of the movb has no full 64bit (or 32bit) reads, we don't pay the cost. But again, I say "seems" consistent, but its hard to tell because testing it involves manually re-allocating registers to achieve that prediction, and that can cause jitter as you say. =/ Performance testing is hard.

I suppose someone could run linux perf tool and report the cpu counters. There’s one called RAT_STALLS.PARTIAL_CYCLES.

As you said below, there’s nothing actionable in the compiler. It’s just a microarchitectural oddity that isn’t supposed to happen.

-Andy 

>  
> Either way it would be great to finally fix this.
> 
> Regarding partial dependece fixing passes, Jakob recently added an optimization to avoid partial flag setting instructions in the thumb2 size reduction pass (there may be other, older attempts that I don’t remember—maybe handling S regs?). Being target specific, the partial CPSR fix just hard codes a couple of high latency opcodes. It should be easy to use the generic TargetSchedModel API instead.
> 
> MachineTraceMetrics goes beyond latency and gives you critical path/slack metrics. It can be useful when the tradeoffs are complicated. It might be overkill for a simple heuristic.
> 
> If there is anything to do here, I think this would both be overkill and wouldn't work. Most of these techniques would conclude that nothing needs to be done here because we write 8bits and then read 8bits, never reading the merged register. The weird thing is that the chip doesn't seem to catch this.
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140107/968cbf9d/attachment.html>