[all-commits] [llvm/llvm-project] da965a: [X86][SLM] Fix MUL uops, latency and throughput

Sat Sep 4 05:22:02 PDT 2021

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: da965a77d566b9295a5928ca4c989650131bfc0b
      https://github.com/llvm/llvm-project/commit/da965a77d566b9295a5928ca4c989650131bfc0b
  Author: Simon Pilgrim <llvm-dev at redking.me.uk>
  Date:   2021-09-04 (Sat, 04 Sep 2021)

  Changed paths:
    M llvm/lib/Target/X86/X86ScheduleSLM.td
    M llvm/test/tools/llvm-mca/X86/SLM/resources-x86_64.s

  Log Message:
  -----------
  [X86][SLM] Fix MUL uops, latency and throughput

These were all set to the same best case mul i32 values (which seems to be the only version of MUL that SLM actually performs well with).

Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.

  Commit: c6371020a801f1da327ec3dcdfa0818fbd6f657a
      https://github.com/llvm/llvm-project/commit/c6371020a801f1da327ec3dcdfa0818fbd6f657a
  Author: Simon Pilgrim <llvm-dev at redking.me.uk>
  Date:   2021-09-04 (Sat, 04 Sep 2021)

  Changed paths:
    M llvm/lib/Target/X86/X86ScheduleSLM.td
    M llvm/test/tools/llvm-mca/X86/SLM/resources-x86_64.s

  Log Message:
  -----------
  [X86][SLM] RMW instructions don't require an extra uop

For RMW instructions, the load and store hold the MEC for an extra cycle, but within the same single uop. This is alluded to in the Intel AOM:

"The MEC also owns the MEC RSV, which is responsible for scheduling of all loads and stores. Load and
store instructions go through addresses generation phase in program order to avoid on-the-fly memory
ordering later in the pipeline. Therefore, an unknown address will stall younger memory instructions."

Noticed while trying to get a cheap SLM test box up and running with llvm-exegesis - RMW arithmetic is always 1uop - and matches what Agner / InstLatX64 report as well.

  Commit: 994da657076900f5ad7fe593c3b5e5f89ab3d53d
      https://github.com/llvm/llvm-project/commit/994da657076900f5ad7fe593c3b5e5f89ab3d53d
  Author: Simon Pilgrim <llvm-dev at redking.me.uk>
  Date:   2021-09-04 (Sat, 04 Sep 2021)

  Changed paths:
    M llvm/lib/Target/X86/X86ScheduleSLM.td
    M llvm/test/tools/llvm-mca/X86/SLM/resources-sse2.s
    M llvm/test/tools/llvm-mca/X86/SLM/resources-sse41.s
    M llvm/test/tools/llvm-mca/X86/SLM/resources-ssse3.s

  Log Message:
  -----------
  [X86][SLM] WriteVecIMul instructions only take 1uop

The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop.

I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD.

But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.

Compare: https://github.com/llvm/llvm-project/compare/fd52b4357a6e...994da6570769