[PATCH] D14971: X86: Emit smaller code for moving 8-bit immediates

Mon Nov 30 21:46:32 PST 2015

silvas added a comment.

In http://reviews.llvm.org/D14971#299120, @hans wrote:

> Thanks for looking into this! How does the "or -1" approach compare in your benchmark?

It is the same execution cost as the MOV version so the result will be the same.

I confirmed with the benchmark: http://reviews.llvm.org/P536
(this also includes 2 more milieu's that stress the ALU in different ways; on Jaguar at least, the ALU throughput for "fast" ops is high enough compared to the decode and retire throughput that neither one ends up actually bottlenecking on the ALU (at most, the bottleneck is a "tie" between the ALU and the retire))

This is where we hit an issue with micro-benchmarking. In this microbenchmark the only difference between the "mov -1" and the "or -1" is the loop-carried dependency which isn't an issue since its loop in the dependence graph is just a single cycle (itself, a lone single-cycle OR operation).
The cost of the false dependency could become higher if the register used by the "or -1" version was the destination of a previous high-latency operation, thus preventing the processor from getting work done "in the shadow" of the high-latency operation. E.g. a code sequence like:

  void foo(int x) {
    if (bar(x))
      return;
    .... do a bunch of work dependent on materializing -1 ....
  }
  bool bar(int x) {
    reg_t* Bucket = computeAddressOfBucket(x);
    reg_t Reg = *Bucket; // Load from a random-ish address, say, not in cache.
    if (Reg != 0)
      return false;
    ... other work ...
  }

If we branch predict into the `return false` of `bar` (and through the `if (bar(x))` which is easy) then we end up in ".... do a bunch of work dependent on materializing -1 ....". If we end up materializing -1 into the same register as the high latency `Reg` register in function `bar`, then using "or -1" to materialize it is potentially very detrimental to performance. If only the processor knew that we didn't actually care about the previous value of the register when materializing -1, we may be able to execute dozens of instructions (limited by the processor's retirement buffer (64 COP's for Jaguar, 200-ish micro-ops for the latest big Intel cores)) while waiting for the load from `Bucket`. If the branch predictions are right, then those dozens of instructions that we managed to get through while waiting for the load are effectively already "done"* once the data arrives from memory. Note that in this situation (correct speculation), all of the stuff that is executed "in the shadow" of the high-latency operation is actually on the critical path (would have had to be executed anyway), so this speculative execution is potentially chopping about min(latency of the high-latency operation, size of the reorder buffer) off the critical path, which can be huge.

To be honest, I don't have measurements on real-world code about how frequently this kind of unfortunate register aliasing happens and how detrimental it is, but it is probably worth avoiding (especially since we have multiple alternatives). Off the top of my head, the only way I can think of indirectly measuring this would be to hack the compiler to insert a bunch of "xor reg, reg" to clear any live-outs of functions that came from memory ops and see the performance effect. Would probably also need a control version that generates the same "xor reg, reg" instructions, but against different registers (e.g. all the same register, which isn't one of the ones we actually care about clearing) to control for the extra decode/icache cost.

*(except for the retirement cost, 2 COP's per cycle in Jaguar, Sandy Bridge / Haswell do 4 fused micro-ops per cycle; the retirement happens in parallel with the rest of the core's operation though, so execution is able to immediately make forward progress)

http://reviews.llvm.org/D14971