[flang] [llvm] [libcxx] [clang-tools-extra] [compiler-rt] [clang] [polly] [mlir] [lld] [CostModel][X86] Fix fpext conversion cost for 16 elements (PR #76278)

Thu Dec 28 18:11:33 PST 2023

HaohaiWen wrote:

> I meant - llvm-mca currently says the throughput for skylake etc. is 3cy not 5cy - so do you know why the intel scheduler models are underestimating the throughput?

SKX schedule model reports correct lat/uops/tpt for each instruction.
vcvtps2pd: https://uops.info/html-instr/VCVTPS2PD_ZMM_YMM.html#SKX
```
Instruction		                      Lat	        TP	Uops	Ports
VEXTRACTF64X4 (YMM, ZMM, I8)	AVX512EVEX	3	1.00 / 1.00	1 / 1	 1*p5
```
vextractf64x4: https://uops.info/html-instr/VEXTRACTF64X4_YMM_ZMM_I8.html#SKX
```
Instruction		                 Lat	         TP	Uops	Ports
VCVTPS2PD (ZMM, YMM)	AVX512EVEX	7	1.00 / 1.09	2 / 2  	1*p05+1*p5
```

There're totally 5 uops, 3 for p5 and 2 for p05. I guess mca thought those 3\*p5 and 2\*p05 can run in parallel.
We can see 2\*p05 indeed went to p0 from nanoBench result. Looks like there're some dependencies and they can't ideally run parallelly. I don't know uiCA analyzed it.

https://github.com/llvm/llvm-project/pull/76278