[libc-commits] [flang] [mlir] [llvm] [libc] [lld] [clang] [openmp] [compiler-rt] [polly] [libcxx] [clang-tools-extra] [CostModel][X86] Fix fpext conversion cost for 16 elements (PR #76278)
    via libc-commits 
    libc-commits at lists.llvm.org
       
    Thu Jan  4 18:52:08 PST 2024
    
    
  
HaohaiWen wrote:
There's cross iteration true dependency in previous experiment.
```
vcvtps2pd zmm2, ymm0
vextractf64x4 ymm0, zmm0, 1
vcvtps2pd zmm1, ymm0
```
The second cvt and first cvt of the next iteration need to wait for finish of vextract64x4. Therefore its cost is 5.
In real scenario, value of zmm0 should be reset to fpext new input.
```
vmovaps zmm0, zmm3
vcvtps2pd zmm2, ymm0
vextractf64x4 ymm0, zmm0, 1
vcvtps2pd zmm1, ymm0
```
This breaks the dependency and now cost is 3.
```
# ./nanoBench.sh -init "xor zmm0, zmm0" -asm "vmovaps zmm0, zmm3; vcvtps2pd zmm2, ymm0; vextractf64x4 ymm0, zmm0, 1; vcvtps2pd zmm1, ymm0" -config configs/cfg_SkylakeX_common.txt -unroll 1000 -loop 1000 -warm_up_count 10 -cpu 0
Note: Hyper-threading is enabled; it can be disabled with "sudo ./disable-HT.sh"
CORE_CYCLES: 3.00
INST_RETIRED: 4.00
IDQ.MITE_UOPS: 6.46
IDQ.DSB_UOPS: -0.45
IDQ.MS_UOPS: 0.01
LSD.UOPS: 0.00
UOPS_ISSUED: 6.01
UOPS_EXECUTED: 5.01
UOPS_RETIRED.RETIRE_SLOTS: 6.01
UOPS_DISPATCHED_PORT.PORT_0: 2.00
UOPS_DISPATCHED_PORT.PORT_1: 0.00
UOPS_DISPATCHED_PORT.PORT_2: 0.00
UOPS_DISPATCHED_PORT.PORT_3: 0.00
UOPS_DISPATCHED_PORT.PORT_4: 0.00
UOPS_DISPATCHED_PORT.PORT_5: 3.00
UOPS_DISPATCHED_PORT.PORT_6: 0.01
UOPS_DISPATCHED_PORT.PORT_7: 0.00
```
https://github.com/llvm/llvm-project/pull/76278
    
    
More information about the libc-commits
mailing list