[PATCH] D66801: [X86][BtVer2] Fix latency and throughput of conditional SIMD store instructions.

Tue Aug 27 06:48:14 PDT 2019

andreadb created this revision.
andreadb added reviewers: RKSimon, craig.topper.
Herald added a subscriber: gbedwell.

On BtVer2 conditional SIMD stores are heavily microcoded.
The latency is directly proportional to the number of packed elements extracted from the input vector. Also, according to micro-benchmarks, most of the computation seems to be done in the integer unit.

Only a minority of the uOPs is executed by the FPU. The observed behaviour on the FPU looks similar to this:

- The input MASK value is moved to the Integer Unit -- [ a VMOVMSK-like uOP - executed on JFPU0].
- In parallel, each element of the input XMM/YMM is extracted and then sent to the IntegerUnit through JFPU1.

As expected, a (conditional) store is executed for every extracted element. Interestingly, a (speculative) load is executed for every extracted element too. It is as-if a "LOAD - BIT_EXTRACT- CMOV" sequence of uOPs is repeated by the integer unit for every contionally stored element.
VMASKMOVDQU is a special case: the number of speculative loads is always 2 (presumably, one load per quadword). That means, extra shifts and masking is performed on (one of) the loaded quadwords before each conditional store (that also explains the big number of non-FP uOPs retired).

https://reviews.llvm.org/D66801

Files:
  lib/Target/X86/X86ScheduleBtVer2.td
  test/tools/llvm-mca/X86/BtVer2/resources-avx1.s
  test/tools/llvm-mca/X86/BtVer2/resources-sse2.s

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D66801.217381.patch
Type: text/x-patch
Size: 10884 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20190827/b6c51750/attachment.bin>