[all-commits] [llvm/llvm-project] da965a: [X86][SLM] Fix MUL uops, latency and throughput
Simon Pilgrim via All-commits
all-commits at lists.llvm.org
Sat Sep 4 05:22:02 PDT 2021
Branch: refs/heads/main
Home: https://github.com/llvm/llvm-project
Commit: da965a77d566b9295a5928ca4c989650131bfc0b
https://github.com/llvm/llvm-project/commit/da965a77d566b9295a5928ca4c989650131bfc0b
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date: 2021-09-04 (Sat, 04 Sep 2021)
Changed paths:
M llvm/lib/Target/X86/X86ScheduleSLM.td
M llvm/test/tools/llvm-mca/X86/SLM/resources-x86_64.s
Log Message:
-----------
[X86][SLM] Fix MUL uops, latency and throughput
These were all set to the same best case mul i32 values (which seems to be the only version of MUL that SLM actually performs well with).
Noticed while trying to improve multiplication costs for vectorization via the D103695 helper script. Confirmed with Intel AoM / Agner / InstLatX64.
Commit: c6371020a801f1da327ec3dcdfa0818fbd6f657a
https://github.com/llvm/llvm-project/commit/c6371020a801f1da327ec3dcdfa0818fbd6f657a
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date: 2021-09-04 (Sat, 04 Sep 2021)
Changed paths:
M llvm/lib/Target/X86/X86ScheduleSLM.td
M llvm/test/tools/llvm-mca/X86/SLM/resources-x86_64.s
Log Message:
-----------
[X86][SLM] RMW instructions don't require an extra uop
For RMW instructions, the load and store hold the MEC for an extra cycle, but within the same single uop. This is alluded to in the Intel AOM:
"The MEC also owns the MEC RSV, which is responsible for scheduling of all loads and stores. Load and
store instructions go through addresses generation phase in program order to avoid on-the-fly memory
ordering later in the pipeline. Therefore, an unknown address will stall younger memory instructions."
Noticed while trying to get a cheap SLM test box up and running with llvm-exegesis - RMW arithmetic is always 1uop - and matches what Agner / InstLatX64 report as well.
Commit: 994da657076900f5ad7fe593c3b5e5f89ab3d53d
https://github.com/llvm/llvm-project/commit/994da657076900f5ad7fe593c3b5e5f89ab3d53d
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date: 2021-09-04 (Sat, 04 Sep 2021)
Changed paths:
M llvm/lib/Target/X86/X86ScheduleSLM.td
M llvm/test/tools/llvm-mca/X86/SLM/resources-sse2.s
M llvm/test/tools/llvm-mca/X86/SLM/resources-sse41.s
M llvm/test/tools/llvm-mca/X86/SLM/resources-ssse3.s
Log Message:
-----------
[X86][SLM] WriteVecIMul instructions only take 1uop
The xmm variant have half the throughput (and +1cy latency) of the mmx variants, but are still 1uop.
I still need to do more thorough testing of SLM on test-suite before fixing the obvious bad numbers for WritePMULLD.
But this helps the D103695 helper script get to more accurate numbers for vXi32 multiplies of extended operands (i.e. we can use PMADDWD, PMULLW/PMULHW etc). Matches what Intel AoM / Agner / llvm-exegesis reports.
Compare: https://github.com/llvm/llvm-project/compare/fd52b4357a6e...994da6570769
More information about the All-commits
mailing list