[libcxx-commits] [flang] [llvm] [libcxx] [clang-tools-extra] [compiler-rt] [clang] [polly] [mlir] [lld] [CostModel][X86] Fix fpext conversion cost for 16 elements (PR #76278)
via libcxx-commits
libcxx-commits at lists.llvm.org
Thu Dec 28 18:11:33 PST 2023
HaohaiWen wrote:
> I meant - llvm-mca currently says the throughput for skylake etc. is 3cy not 5cy - so do you know why the intel scheduler models are underestimating the throughput?
SKX schedule model reports correct lat/uops/tpt for each instruction.
vcvtps2pd: https://uops.info/html-instr/VCVTPS2PD_ZMM_YMM.html#SKX
```
Instruction Lat TP Uops Ports
VEXTRACTF64X4 (YMM, ZMM, I8) AVX512EVEX 3 1.00 / 1.00 1 / 1 1*p5
```
vextractf64x4: https://uops.info/html-instr/VEXTRACTF64X4_YMM_ZMM_I8.html#SKX
```
Instruction Lat TP Uops Ports
VCVTPS2PD (ZMM, YMM) AVX512EVEX 7 1.00 / 1.09 2 / 2 1*p05+1*p5
```
There're totally 5 uops, 3 for p5 and 2 for p05. I guess mca thought those 3\*p5 and 2\*p05 can run in parallel.
We can see 2\*p05 indeed went to p0 from nanoBench result. Looks like there're some dependencies and they can't ideally run parallelly. I don't know uiCA analyzed it.
https://github.com/llvm/llvm-project/pull/76278
More information about the libcxx-commits
mailing list