[llvm] r195496 - X86: Perform integer comparisons at i32 or larger.

Mon Jan 6 17:33:47 PST 2014

On Sun, Dec 29, 2013 at 5:53 PM, Andrew Trick <atrick at apple.com> wrote:

> To confirm Owen's analysis I can tell you that bzip2 on x86 appears highly
> sensitive to register allocation with a jitter of ~3%. (Although it seems
> to contradict Agner's statements that SandyBridge renames partial
> registers).
>

The more measurements I do, the more I come to believe that at least on
SandyBridge what is happening is this:

movb (...,%rax), %al

We have a partial write and a full read of %rax. Even though this is
reading a clobber of %rax, this still ends up paying *some* of the cost of
the partial register write, but not all of it. My guess is that this is the
opcode for the merge rather than a stall due to a dependency, but I'm just
guessing at that point. What seems consistent is that if the destination
register of the movb has no full 64bit (or 32bit) reads, we don't pay the
cost. But again, I say "seems" consistent, but its hard to tell because
testing it involves manually re-allocating registers to achieve that
prediction, and that can cause jitter as you say. =/ Performance testing is
hard.

> Either way it would be great to finally fix this.
>
> Regarding partial dependece fixing passes, Jakob recently added an
> optimization to avoid partial flag setting instructions in the thumb2 size
> reduction pass (there may be other, older attempts that I don’t
> remember—maybe handling S regs?). Being target specific, the partial CPSR
> fix just hard codes a couple of high latency opcodes. It should be easy to
> use the generic TargetSchedModel API instead.
>
> MachineTraceMetrics goes beyond latency and gives you critical path/slack
> metrics. It can be useful when the tradeoffs are complicated. It might be
> overkill for a simple heuristic.
>

If there is anything to do here, I think this would both be overkill and
wouldn't work. Most of these techniques would conclude that nothing needs
to be done here because we write 8bits and then read 8bits, never reading
the merged register. The weird thing is that the chip doesn't seem to catch
this.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140106/cd905e24/attachment.html>