[llvm-bugs] [Bug 32863] New: feature rq: track "cold" vector registers for use as don't-care sources to avoid false dependencies

Mon May 1 01:36:27 PDT 2017

https://bugs.llvm.org/show_bug.cgi?id=32863

            Bug ID: 32863
           Summary: feature rq: track "cold" vector registers for use as
                    don't-care sources to avoid false dependencies
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: peter at cordes.ca
                CC: llvm-bugs at lists.llvm.org

(This is a summary / rewrite of what I wrote while having this idea on an old
closed bug: https://bugs.llvm.org/show_bug.cgi?id=22024#c11.  See that for some
Haswell perf analysis of scalar int->FP conversion.)

x86 has several cases of inconvenient input-dependencies, either for scalar
stuff in vector regs or for stuff like generating a vector of all-ones on CPUs
that don't recognize PCMPEQD same,same as independent of its inputs.  The usual
solution is to break dependencies with pxor same,same before doing something,
or to guess / hope that a register unused by this function is safe to use.

But with AVX 3-operand instructions, we can use a different strategy:  reuse a
known-safe register as the don't-care input without destroying it.

Such a register doesn't have to have been xor-zeroed; it can be holding a
loop-invariant constant.  Or we can vpxor one such register and reuse it for
the rest of the function (or until we make a function call, which could return
with OOO execution still chewing through a long dep chain on that register).

The use cases where having a safe read-only register helps include:

 * vcvtsi2ss/sd %r64,%merge_into, %xmm destination  # badly-designed
instruction
 * vsqrtss     (mem),%merge_into, %xmm
 * vpcmpeqd    %same,%same, %dest    # false dep on KNL / Silvermont
 * vcmptrueps  %same,%same, %ymm     # splat -1 without AVX2.  false dep on all
known uarches
 * Maybe the some weird shuffle use-cases?

The most important / common one by far is int->float conversion, due to Intel's
short-sighted design of SSE, and decision to keep that behaviour in the AVX
versions.  (good for consistency, bad for performance).  Anyway, hoisting a
VXORPS out of a loop that includes a vcvtsi2sd is an obvious win.

clang already sort of does this for int->float conversions: AFAICT it picks a
register unused in the function, and gambles that it is cold.  This is a
reasonable strategy, but it falls apart under register pressure.  (And more
sophisticated tracking can also avoid gambling that a caller or callee didn't
leave a register at the end of a long dep chain independent from the int->float
conversion we're doing.  e.g. near the end of a dep chain that includes a
cache-miss or a loop accumulator.)  Although perhaps this gamble is still worth
the code-size savings from leaving out a lot of vxorps instructions.

If you use up all the xmm regs with constants, then clang will put a
vxorps-zeroing instruction into the loop and then replace it with a constant. 
Better would be to simply use one of the constants as the merge-dest for
vcvtsi2sd.  The x86-64 SysV ABI (and I assume other ABIs) allows passing scalar
float/double args with garbage (not zeros) in the high bytes, and this is
already something that happens in practice, so there's no reason to worry about
not "cleaning up" the results of this before making function calls.

See this test-case on godbolt for clang trunk 301740, gcc8 20170429, icc17, and
MSVC CL19 2017.  With 16 constants needed, clang keeps two of them in memory
since it uses two scratch regs for no reason/benefit.  (And has a vxorps in the
loop).

See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571 which I reported
just about the int->float part of this issue.

-----

Related cases: non-read-only, with a false dep on the output: tracking
not-recently-modified registers can let us pick a not-recently-modified
register to clobber, and/or decide whether to use a vpxor-zeroing instruction
depending on how long ago the last modification was of the register we picked. 
e.g. after one loop, before another loop, all the dead constant registers from
the finished loop are safe to read.

* all of the above without AVX, where dst=src2.

* vpternlogd  $0xff,any,any, src3/dst  # zmm splat -1: false dep on the dst
which is also a 3rd source reg.   All 3 vectors are inputs, so we need a stale
reg we can clobber (or a vpxor dep-breaker).  Hardware could avoid this by
treating imm8=0xFF as as special case, but neither KNL nor skylake-avx512 do. 
(I checked skylake-avx512 on a google-compute-engine VM: definitely a false
dep: it runs about twice as fast when adding a vpxor to the loop.  Appears to
be something like 1c latency, one per 0.5c throughput).

vsqrtss with a register (not memory) source can use  src,src,dest with AVX,
avoiding the false dependency that src,dest,dest has.  (clang4.0 gets this
right; 3.9.1 and earlier are like gcc and use vsqrtss %xmm1,%xmm0,%xmm0).  ICC
uses vsqrtss %xmm1,%xmm1,%xmm1 and then vmovaps.)  int->float conversion with
vcvtsi2ss can't use this trick because the source operands aren't both vector
regs.

---

If we had such a readyness/coldness/dep-chain tracking infrastructure,
_mm_undefined_ps() could take advantage of it to make a good choice for which
dead register to pick.  (And whether to dep-break it.)  This is useful for
things like a horizontal-sum function that wants to use MOVHLPS to avoid extra
MOVAPS instructions when extracting the high half of a vector with only SSE2. 
(With SSE3, MOVSHDUP is a great first-step as an FP copy+shuffle.  Then you can
use the original __m128 C variable as a destination for MOVHLPS, since it's
from earlier in the same dep chain but dead now.)

---------

That reminds me: for instructions that do have a real source (like sqrtss), an
output dependency on a register that had to be ready earlier in the same dep
chain is always safe.

e.g. if we want to keep around a*b and sqrtf(a*b), we can do this without any
ill effects from sqrtss's dependency on its output:

  mulss    %xmm0, %xmm1    
  sqrtss   %xmm1, %xmm0    # xmm1 being ready means xmm0 is also ready

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20170501/a2c4a0d0/attachment-0001.html>