[PATCH] D25722: Improved cost model for FDIV and FSQRT

Thu Oct 27 06:35:17 PDT 2016

avt77 added inline comments.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:269
+    { ISD::FDIV,  MVT::v4f32, 14 }, // IACA value for SandyBridge arch
+    { ISD::FDIV,  MVT::v8f32, 41 }, // IACA value for SandyBridge arch
+    { ISD::FDIV,  MVT::f64,   21 }, // IACA value for SandyBridge arch
----------------
avt77 wrote:
> mkuper wrote:
> > A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd.
> > 
> > I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests?
> > 
> > (Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2)
> > 
> I use the following numbers:
> atischenko at ip-172-31-21-62:~/iaca-lin64/bin$ ./test-arith-fdiv.sh
> ***************************************SandyBridge***************************************
> Intel(R) Architecture Code Analyzer Version - 2.1
> Analyzed File - arith-fdiv.o
> Binary Format - 64Bit
> Architecture  - SNB
> Analysis Type - Throughput
> 
> Throughput Analysis Report
> --------------------------
> Block Throughput: 14.00 Cycles       Throughput Bottleneck: Divider
> 
> Port Binding In Cycles Per Iteration:
> -------------------------------------------------------------------------
> |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
> -------------------------------------------------------------------------
> | Cycles | 1.0    14.0 | 0.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 0.0  |
> -------------------------------------------------------------------------
> 
> N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
> D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
> F - Macro Fusion with the previous instruction occurred
> * - instruction micro-ops not bound to a port
> ^ - Micro Fusion happened
> # - ESP Tracking sync uop was issued
> @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
> ! - instruction not supported, was not accounted in Analysis
> 
> | Num Of |                   Ports pressure in cycles                   |    |
> |  Uops  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |    |
> ------------------------------------------------------------------------------
> |   1    | 1.0    14.0 |      |             |             |      |      | CP | vdivps xmm0, xmm0, xmm0
> Total Num Of Uops: 1
> 
> ===========================================================================
> atischenko at ip-172-31-21-62:~/iaca-lin64/bin$ 
> ./test-arith-fdiv.sh
> ***************************************SandyBridge***************************************
> Intel(R) Architecture Code Analyzer Version - 2.1
> Analyzed File - arith-fdiv.o
> Binary Format - 64Bit
> Architecture  - SNB
> Analysis Type - Throughput
> 
> Throughput Analysis Report
> --------------------------
> Block Throughput: 41.00 Cycles       Throughput Bottleneck: InterIteration
> 
> Port Binding In Cycles Per Iteration:
> -------------------------------------------------------------------------
> |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
> -------------------------------------------------------------------------
> | Cycles | 2.0    28.0 | 0.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.0  |
> -------------------------------------------------------------------------
> 
> N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
> D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
> F - Macro Fusion with the previous instruction occurred
> * - instruction micro-ops not bound to a port
> ^ - Micro Fusion happened
> # - ESP Tracking sync uop was issued
> @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
> ! - instruction not supported, was not accounted in Analysis
> 
> | Num Of |                   Ports pressure in cycles                   |    |
> |  Uops  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |    |
> ------------------------------------------------------------------------------
> |   3    | 2.0    28.0 |      |             |             |      | 1.0  | CP | vdivps ymm0, ymm0, ymm0
> Total Num Of Uops: 3
> 
> If we use DV value then it's about 2 times. But I used "Block Throughput" value above. As you see my block is exactly one instruction that's why I decided to use "Block Throughput". Maybe I'm wrong? Can anyone suggest me the right decision?
Maybe I understood the difference between Anger's and IACA numbers (for SandyBridge at least): Anger's table has numbers for ymm only :
VDIVPS y,y,y 3 3 2 1 21-29 20-28 AVX
VDIVPS y,y,m256 4 3 2 1 1+ 20-28 AVX

but when I played with IACA it showed different types of  operands: xmm0, ... and ymm0, ....

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:269
+    { ISD::FDIV,  MVT::v4f32, 14 }, // IACA value for SandyBridge arch
+    { ISD::FDIV,  MVT::v8f32, 41 }, // IACA value for SandyBridge arch
+    { ISD::FDIV,  MVT::f64,   21 }, // IACA value for SandyBridge arch
----------------
avt77 wrote:
> avt77 wrote:
> > mkuper wrote:
> > > A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd.
> > > 
> > > I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests?
> > > 
> > > (Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2)
> > > 
> > I use the following numbers:
> > atischenko at ip-172-31-21-62:~/iaca-lin64/bin$ ./test-arith-fdiv.sh
> > ***************************************SandyBridge***************************************
> > Intel(R) Architecture Code Analyzer Version - 2.1
> > Analyzed File - arith-fdiv.o
> > Binary Format - 64Bit
> > Architecture  - SNB
> > Analysis Type - Throughput
> > 
> > Throughput Analysis Report
> > --------------------------
> > Block Throughput: 14.00 Cycles       Throughput Bottleneck: Divider
> > 
> > Port Binding In Cycles Per Iteration:
> > -------------------------------------------------------------------------
> > |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
> > -------------------------------------------------------------------------
> > | Cycles | 1.0    14.0 | 0.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 0.0  |
> > -------------------------------------------------------------------------
> > 
> > N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
> > D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
> > F - Macro Fusion with the previous instruction occurred
> > * - instruction micro-ops not bound to a port
> > ^ - Micro Fusion happened
> > # - ESP Tracking sync uop was issued
> > @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
> > ! - instruction not supported, was not accounted in Analysis
> > 
> > | Num Of |                   Ports pressure in cycles                   |    |
> > |  Uops  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |    |
> > ------------------------------------------------------------------------------
> > |   1    | 1.0    14.0 |      |             |             |      |      | CP | vdivps xmm0, xmm0, xmm0
> > Total Num Of Uops: 1
> > 
> > ===========================================================================
> > atischenko at ip-172-31-21-62:~/iaca-lin64/bin$ 
> > ./test-arith-fdiv.sh
> > ***************************************SandyBridge***************************************
> > Intel(R) Architecture Code Analyzer Version - 2.1
> > Analyzed File - arith-fdiv.o
> > Binary Format - 64Bit
> > Architecture  - SNB
> > Analysis Type - Throughput
> > 
> > Throughput Analysis Report
> > --------------------------
> > Block Throughput: 41.00 Cycles       Throughput Bottleneck: InterIteration
> > 
> > Port Binding In Cycles Per Iteration:
> > -------------------------------------------------------------------------
> > |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
> > -------------------------------------------------------------------------
> > | Cycles | 2.0    28.0 | 0.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.0  |
> > -------------------------------------------------------------------------
> > 
> > N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
> > D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
> > F - Macro Fusion with the previous instruction occurred
> > * - instruction micro-ops not bound to a port
> > ^ - Micro Fusion happened
> > # - ESP Tracking sync uop was issued
> > @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
> > ! - instruction not supported, was not accounted in Analysis
> > 
> > | Num Of |                   Ports pressure in cycles                   |    |
> > |  Uops  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |    |
> > ------------------------------------------------------------------------------
> > |   3    | 2.0    28.0 |      |             |             |      | 1.0  | CP | vdivps ymm0, ymm0, ymm0
> > Total Num Of Uops: 3
> > 
> > If we use DV value then it's about 2 times. But I used "Block Throughput" value above. As you see my block is exactly one instruction that's why I decided to use "Block Throughput". Maybe I'm wrong? Can anyone suggest me the right decision?
> Maybe I understood the difference between Anger's and IACA numbers (for SandyBridge at least): Anger's table has numbers for ymm only :
> VDIVPS y,y,y 3 3 2 1 21-29 20-28 AVX
> VDIVPS y,y,m256 4 3 2 1 1+ 20-28 AVX
> 
> but when I played with IACA it showed different types of  operands: xmm0, ... and ymm0, ....
And another comment: for f32/v4f32 IACA shows 14 for vdivss/vdivps but Agner's table shows the same 10-14 for divss/divps and 20-28 for vdivss/vdivps. Clang generates vdivss/vdivps (for SNB target) that's why I should select 28 as a cost value is we're using Agner's numbers. Is it OK?

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:269
+    { ISD::FDIV,  MVT::v4f32, 14 }, // IACA value for SandyBridge arch
+    { ISD::FDIV,  MVT::v8f32, 41 }, // IACA value for SandyBridge arch
+    { ISD::FDIV,  MVT::f64,   21 }, // IACA value for SandyBridge arch
----------------
avt77 wrote:
> avt77 wrote:
> > avt77 wrote:
> > > mkuper wrote:
> > > > A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd.
> > > > 
> > > > I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests?
> > > > 
> > > > (Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2)
> > > > 
> > > I use the following numbers:
> > > atischenko at ip-172-31-21-62:~/iaca-lin64/bin$ ./test-arith-fdiv.sh
> > > ***************************************SandyBridge***************************************
> > > Intel(R) Architecture Code Analyzer Version - 2.1
> > > Analyzed File - arith-fdiv.o
> > > Binary Format - 64Bit
> > > Architecture  - SNB
> > > Analysis Type - Throughput
> > > 
> > > Throughput Analysis Report
> > > --------------------------
> > > Block Throughput: 14.00 Cycles       Throughput Bottleneck: Divider
> > > 
> > > Port Binding In Cycles Per Iteration:
> > > -------------------------------------------------------------------------
> > > |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
> > > -------------------------------------------------------------------------
> > > | Cycles | 1.0    14.0 | 0.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 0.0  |
> > > -------------------------------------------------------------------------
> > > 
> > > N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
> > > D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
> > > F - Macro Fusion with the previous instruction occurred
> > > * - instruction micro-ops not bound to a port
> > > ^ - Micro Fusion happened
> > > # - ESP Tracking sync uop was issued
> > > @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
> > > ! - instruction not supported, was not accounted in Analysis
> > > 
> > > | Num Of |                   Ports pressure in cycles                   |    |
> > > |  Uops  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |    |
> > > ------------------------------------------------------------------------------
> > > |   1    | 1.0    14.0 |      |             |             |      |      | CP | vdivps xmm0, xmm0, xmm0
> > > Total Num Of Uops: 1
> > > 
> > > ===========================================================================
> > > atischenko at ip-172-31-21-62:~/iaca-lin64/bin$ 
> > > ./test-arith-fdiv.sh
> > > ***************************************SandyBridge***************************************
> > > Intel(R) Architecture Code Analyzer Version - 2.1
> > > Analyzed File - arith-fdiv.o
> > > Binary Format - 64Bit
> > > Architecture  - SNB
> > > Analysis Type - Throughput
> > > 
> > > Throughput Analysis Report
> > > --------------------------
> > > Block Throughput: 41.00 Cycles       Throughput Bottleneck: InterIteration
> > > 
> > > Port Binding In Cycles Per Iteration:
> > > -------------------------------------------------------------------------
> > > |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
> > > -------------------------------------------------------------------------
> > > | Cycles | 2.0    28.0 | 0.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.0  |
> > > -------------------------------------------------------------------------
> > > 
> > > N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
> > > D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
> > > F - Macro Fusion with the previous instruction occurred
> > > * - instruction micro-ops not bound to a port
> > > ^ - Micro Fusion happened
> > > # - ESP Tracking sync uop was issued
> > > @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
> > > ! - instruction not supported, was not accounted in Analysis
> > > 
> > > | Num Of |                   Ports pressure in cycles                   |    |
> > > |  Uops  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |    |
> > > ------------------------------------------------------------------------------
> > > |   3    | 2.0    28.0 |      |             |             |      | 1.0  | CP | vdivps ymm0, ymm0, ymm0
> > > Total Num Of Uops: 3
> > > 
> > > If we use DV value then it's about 2 times. But I used "Block Throughput" value above. As you see my block is exactly one instruction that's why I decided to use "Block Throughput". Maybe I'm wrong? Can anyone suggest me the right decision?
> > Maybe I understood the difference between Anger's and IACA numbers (for SandyBridge at least): Anger's table has numbers for ymm only :
> > VDIVPS y,y,y 3 3 2 1 21-29 20-28 AVX
> > VDIVPS y,y,m256 4 3 2 1 1+ 20-28 AVX
> > 
> > but when I played with IACA it showed different types of  operands: xmm0, ... and ymm0, ....
> And another comment: for f32/v4f32 IACA shows 14 for vdivss/vdivps but Agner's table shows the same 10-14 for divss/divps and 20-28 for vdivss/vdivps. Clang generates vdivss/vdivps (for SNB target) that's why I should select 28 as a cost value is we're using Agner's numbers. Is it OK?
As an intermediate decision I did the following for SNB numbers: xmm operands use the first number as a cost while ymm operands use the second number. For example:

    { ISD::FDIV, MVT::f32,   20 }, // SNB from http://www.agner.org/ (IACA: 14)
    { ISD::FDIV, MVT::v4f32, 20 }, // SNB from http://www.agner.org/ (IACA: 14)
    { ISD::FDIV, MVT::v8f32, 28 }, // SNB from http://www.agner.org/ (IACA: 41)

Is it acceptable? In fact we need here some extension of our current Cost tables: it is not enough to know the instruction itself and the type of its operands. We need some target dependent info as well (e.g. if we're speaking about X86 targets only (exactly our case) then we could add the size of the target registers or something similar, e.g. ISA:  SNB/HSW could use both divps and vdivps and usage of 2 divps could be better then usage of one vdivps but our current Cost Model cannot help to decide what's better).

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:269
+    { ISD::FDIV,  MVT::v4f32, 14 }, // IACA value for SandyBridge arch
+    { ISD::FDIV,  MVT::v8f32, 41 }, // IACA value for SandyBridge arch
+    { ISD::FDIV,  MVT::f64,   21 }, // IACA value for SandyBridge arch
----------------
avt77 wrote:
> avt77 wrote:
> > avt77 wrote:
> > > avt77 wrote:
> > > > mkuper wrote:
> > > > > A YMM fdiv being 3 times as expensive as a XMM fdiv seems slightly odd.
> > > > > 
> > > > > I'd expect "2, maybe a bit more", and Agner seems to agree - e.g. for Sandybridge he gives a range of 10-14 cycles for XMM DIVPS, and 20-28 cycles for YMM VDIVPS. Were your IACA checks accounting for additional instructions? Or is this an inconsistency between IACA and Anger's tests?
> > > > > 
> > > > > (Note that these numbers are supposed to represent reciprocal throughput, but Agner's data for latency also has factor of ~2)
> > > > > 
> > > > I use the following numbers:
> > > > atischenko at ip-172-31-21-62:~/iaca-lin64/bin$ ./test-arith-fdiv.sh
> > > > ***************************************SandyBridge***************************************
> > > > Intel(R) Architecture Code Analyzer Version - 2.1
> > > > Analyzed File - arith-fdiv.o
> > > > Binary Format - 64Bit
> > > > Architecture  - SNB
> > > > Analysis Type - Throughput
> > > > 
> > > > Throughput Analysis Report
> > > > --------------------------
> > > > Block Throughput: 14.00 Cycles       Throughput Bottleneck: Divider
> > > > 
> > > > Port Binding In Cycles Per Iteration:
> > > > -------------------------------------------------------------------------
> > > > |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
> > > > -------------------------------------------------------------------------
> > > > | Cycles | 1.0    14.0 | 0.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 0.0  |
> > > > -------------------------------------------------------------------------
> > > > 
> > > > N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
> > > > D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
> > > > F - Macro Fusion with the previous instruction occurred
> > > > * - instruction micro-ops not bound to a port
> > > > ^ - Micro Fusion happened
> > > > # - ESP Tracking sync uop was issued
> > > > @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
> > > > ! - instruction not supported, was not accounted in Analysis
> > > > 
> > > > | Num Of |                   Ports pressure in cycles                   |    |
> > > > |  Uops  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |    |
> > > > ------------------------------------------------------------------------------
> > > > |   1    | 1.0    14.0 |      |             |             |      |      | CP | vdivps xmm0, xmm0, xmm0
> > > > Total Num Of Uops: 1
> > > > 
> > > > ===========================================================================
> > > > atischenko at ip-172-31-21-62:~/iaca-lin64/bin$ 
> > > > ./test-arith-fdiv.sh
> > > > ***************************************SandyBridge***************************************
> > > > Intel(R) Architecture Code Analyzer Version - 2.1
> > > > Analyzed File - arith-fdiv.o
> > > > Binary Format - 64Bit
> > > > Architecture  - SNB
> > > > Analysis Type - Throughput
> > > > 
> > > > Throughput Analysis Report
> > > > --------------------------
> > > > Block Throughput: 41.00 Cycles       Throughput Bottleneck: InterIteration
> > > > 
> > > > Port Binding In Cycles Per Iteration:
> > > > -------------------------------------------------------------------------
> > > > |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
> > > > -------------------------------------------------------------------------
> > > > | Cycles | 2.0    28.0 | 0.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.0  |
> > > > -------------------------------------------------------------------------
> > > > 
> > > > N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
> > > > D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
> > > > F - Macro Fusion with the previous instruction occurred
> > > > * - instruction micro-ops not bound to a port
> > > > ^ - Micro Fusion happened
> > > > # - ESP Tracking sync uop was issued
> > > > @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
> > > > ! - instruction not supported, was not accounted in Analysis
> > > > 
> > > > | Num Of |                   Ports pressure in cycles                   |    |
> > > > |  Uops  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |    |
> > > > ------------------------------------------------------------------------------
> > > > |   3    | 2.0    28.0 |      |             |             |      | 1.0  | CP | vdivps ymm0, ymm0, ymm0
> > > > Total Num Of Uops: 3
> > > > 
> > > > If we use DV value then it's about 2 times. But I used "Block Throughput" value above. As you see my block is exactly one instruction that's why I decided to use "Block Throughput". Maybe I'm wrong? Can anyone suggest me the right decision?
> > > Maybe I understood the difference between Anger's and IACA numbers (for SandyBridge at least): Anger's table has numbers for ymm only :
> > > VDIVPS y,y,y 3 3 2 1 21-29 20-28 AVX
> > > VDIVPS y,y,m256 4 3 2 1 1+ 20-28 AVX
> > > 
> > > but when I played with IACA it showed different types of  operands: xmm0, ... and ymm0, ....
> > And another comment: for f32/v4f32 IACA shows 14 for vdivss/vdivps but Agner's table shows the same 10-14 for divss/divps and 20-28 for vdivss/vdivps. Clang generates vdivss/vdivps (for SNB target) that's why I should select 28 as a cost value is we're using Agner's numbers. Is it OK?
> As an intermediate decision I did the following for SNB numbers: xmm operands use the first number as a cost while ymm operands use the second number. For example:
> 
>     { ISD::FDIV, MVT::f32,   20 }, // SNB from http://www.agner.org/ (IACA: 14)
>     { ISD::FDIV, MVT::v4f32, 20 }, // SNB from http://www.agner.org/ (IACA: 14)
>     { ISD::FDIV, MVT::v8f32, 28 }, // SNB from http://www.agner.org/ (IACA: 41)
> 
> Is it acceptable? In fact we need here some extension of our current Cost tables: it is not enough to know the instruction itself and the type of its operands. We need some target dependent info as well (e.g. if we're speaking about X86 targets only (exactly our case) then we could add the size of the target registers or something similar, e.g. ISA:  SNB/HSW could use both divps and vdivps and usage of 2 divps could be better then usage of one vdivps but our current Cost Model cannot help to decide what's better).
BTW, I've just realized that Haswell IACA numbers are really closed to the expected ones:

f32     13 vdivss
v4f32 13 vdivps  xmm0...
v8f32  26 vdivps ymm0...
f64      20 vdivsd
v2f64  20 vdivpd
v4f62  47  vdivpd ymm0...

It means we have problems with IACA numbers for SNB only (too big difference for xmm and ymm). Maybe new CPUs simply resolved this issue?

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:1088
   };
   static const CostTblEntry SSSE3CostTbl[] = {
     { ISD::BITREVERSE, MVT::v2i64,   5 },
----------------
RKSimon wrote:
> RKSimon wrote:
> > Worth adding a SSE41CostTbl for Core2 era costs?
> Please add Nehalem costs (from Agner) - they're notably better than the P4 default:
> 
> FSQRT f32/4f32 : 18 f64/2f64 : 32
JFYI, I got the same numbers for Nehalem with IACA

https://reviews.llvm.org/D25722