[PATCH] D25722: Improved cost model for FDIV and FSQRT
Andrew V. Tischenko via llvm-commits
llvm-commits at lists.llvm.org
Tue Oct 25 01:12:29 PDT 2016
avt77 added inline comments.
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:269
+ { ISD::FDIV, MVT::v4f32, 14 }, // IACA value for SandyBridge arch
+ { ISD::FDIV, MVT::v8f32, 41 }, // IACA value for SandyBridge arch
+ { ISD::FDIV, MVT::f64, 21 }, // IACA value for SandyBridge arch
----------------
mkuper wrote:
> A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd.
>
> I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests?
>
> (Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2)
>
I use the following numbers:
atischenko at ip-172-31-21-62:~/iaca-lin64/bin$ ./test-arith-fdiv.sh
***************************************SandyBridge***************************************
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - arith-fdiv.o
Binary Format - 64Bit
Architecture - SNB
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 14.00 Cycles Throughput Bottleneck: Divider
Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
-------------------------------------------------------------------------
| Cycles | 1.0 14.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 0.0 |
-------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
------------------------------------------------------------------------------
| 1 | 1.0 14.0 | | | | | | CP | vdivps xmm0, xmm0, xmm0
Total Num Of Uops: 1
===========================================================================
atischenko at ip-172-31-21-62:~/iaca-lin64/bin$
./test-arith-fdiv.sh
***************************************SandyBridge***************************************
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - arith-fdiv.o
Binary Format - 64Bit
Architecture - SNB
Analysis Type - Throughput
Throughput Analysis Report
--------------------------
Block Throughput: 41.00 Cycles Throughput Bottleneck: InterIteration
Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
-------------------------------------------------------------------------
| Cycles | 2.0 28.0 | 0.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |
-------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
------------------------------------------------------------------------------
| 3 | 2.0 28.0 | | | | | 1.0 | CP | vdivps ymm0, ymm0, ymm0
Total Num Of Uops: 3
If we use DV value then it's about 2 times. But I used "Block Throughput" value above. As you see my block is exactly one instruction that's why I decided to use "Block Throughput". Maybe I'm wrong? Can anyone suggest me the right decision?
https://reviews.llvm.org/D25722
More information about the llvm-commits
mailing list