[libc-commits] [flang] [mlir] [llvm] [libc] [lld] [clang] [openmp] [compiler-rt] [polly] [libcxx] [clang-tools-extra] [CostModel][X86] Fix fpext conversion cost for 16 elements (PR #76278)
via libc-commits
libc-commits at lists.llvm.org
Thu Jan 4 18:52:08 PST 2024
HaohaiWen wrote:
There's cross iteration true dependency in previous experiment.
```
vcvtps2pd zmm2, ymm0
vextractf64x4 ymm0, zmm0, 1
vcvtps2pd zmm1, ymm0
```
The second cvt and first cvt of the next iteration need to wait for finish of vextract64x4. Therefore its cost is 5.
In real scenario, value of zmm0 should be reset to fpext new input.
```
vmovaps zmm0, zmm3
vcvtps2pd zmm2, ymm0
vextractf64x4 ymm0, zmm0, 1
vcvtps2pd zmm1, ymm0
```
This breaks the dependency and now cost is 3.
```
# ./nanoBench.sh -init "xor zmm0, zmm0" -asm "vmovaps zmm0, zmm3; vcvtps2pd zmm2, ymm0; vextractf64x4 ymm0, zmm0, 1; vcvtps2pd zmm1, ymm0" -config configs/cfg_SkylakeX_common.txt -unroll 1000 -loop 1000 -warm_up_count 10 -cpu 0
Note: Hyper-threading is enabled; it can be disabled with "sudo ./disable-HT.sh"
CORE_CYCLES: 3.00
INST_RETIRED: 4.00
IDQ.MITE_UOPS: 6.46
IDQ.DSB_UOPS: -0.45
IDQ.MS_UOPS: 0.01
LSD.UOPS: 0.00
UOPS_ISSUED: 6.01
UOPS_EXECUTED: 5.01
UOPS_RETIRED.RETIRE_SLOTS: 6.01
UOPS_DISPATCHED_PORT.PORT_0: 2.00
UOPS_DISPATCHED_PORT.PORT_1: 0.00
UOPS_DISPATCHED_PORT.PORT_2: 0.00
UOPS_DISPATCHED_PORT.PORT_3: 0.00
UOPS_DISPATCHED_PORT.PORT_4: 0.00
UOPS_DISPATCHED_PORT.PORT_5: 3.00
UOPS_DISPATCHED_PORT.PORT_6: 0.01
UOPS_DISPATCHED_PORT.PORT_7: 0.00
```
https://github.com/llvm/llvm-project/pull/76278
More information about the libc-commits
mailing list