[PATCH] D25722: Improved cost model for FDIV and FSQRT

Tue Oct 25 01:12:29 PDT 2016

avt77 added inline comments.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:269
+    { ISD::FDIV,  MVT::v4f32, 14 }, // IACA value for SandyBridge arch
+    { ISD::FDIV,  MVT::v8f32, 41 }, // IACA value for SandyBridge arch
+    { ISD::FDIV,  MVT::f64,   21 }, // IACA value for SandyBridge arch
----------------
mkuper wrote:
> A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd.
> 
> I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests?
> 
> (Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2)
> 
I use the following numbers:
atischenko at ip-172-31-21-62:~/iaca-lin64/bin$ ./test-arith-fdiv.sh
***************************************SandyBridge***************************************
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - arith-fdiv.o
Binary Format - 64Bit
Architecture  - SNB
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 14.00 Cycles       Throughput Bottleneck: Divider

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 1.0    14.0 | 0.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 0.0  |
-------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                   Ports pressure in cycles                   |    |
|  Uops  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |    |
------------------------------------------------------------------------------
|   1    | 1.0    14.0 |      |             |             |      |      | CP | vdivps xmm0, xmm0, xmm0
Total Num Of Uops: 1

===========================================================================
atischenko at ip-172-31-21-62:~/iaca-lin64/bin$ 
./test-arith-fdiv.sh
***************************************SandyBridge***************************************
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - arith-fdiv.o
Binary Format - 64Bit
Architecture  - SNB
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 41.00 Cycles       Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 2.0    28.0 | 0.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.0  |
-------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                   Ports pressure in cycles                   |    |
|  Uops  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |    |
------------------------------------------------------------------------------
|   3    | 2.0    28.0 |      |             |             |      | 1.0  | CP | vdivps ymm0, ymm0, ymm0
Total Num Of Uops: 3

If we use DV value then it's about 2 times. But I used "Block Throughput" value above. As you see my block is exactly one instruction that's why I decided to use "Block Throughput". Maybe I'm wrong? Can anyone suggest me the right decision?

https://reviews.llvm.org/D25722