[PATCH] D42981: [COST] Fix cost model of load instructions on X86

Thu Feb 15 09:50:35 PST 2018

spatel added a comment.

In https://reviews.llvm.org/D42981#1001852, @ABataev wrote:

> In https://reviews.llvm.org/D42981#1001119, @spatel wrote:
>
> > In https://reviews.llvm.org/D42981#1001109, @ABataev wrote:
> >
> > > In https://reviews.llvm.org/D42981#1001063, @spatel wrote:
> > >
> > > > The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.
> > > >
> > > > A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.
> > >
> > >
> > > This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.
> >
> >
> > I need some help to understand this. Is SLP double-counting the cost of load instructions? If so, why? If you could explain exactly what is happening in the PR36280 test, that would be good for me.
>
>
> Sure. Here in PR36280 we have a vectorization tree of 3 nodes: 1) float mul insts %mul1, %mul2; 2) load insts %p1, %p2; 3) Gather %x, %y. Also we have 2 external uses %add1 and %add2. When we calculate the cvectorization cost, it is done on per tree-node basis. 1) Node cost is 2 (cost of the vectorized fmul) - (2 + 2) (costs of scalar mults) = -2; 2) Node cost is 1 (cost of the vectorized load) - (1 + 1)(!!!!) (cost of the scalar loads) = -1. Note, that in the resulting code these loads are folded in the vmulss instructions and the cost of these instructions is calculated already when we calculated the cost of the vectorization of the float point multiplacations. The resl cost must be 1 (vector load) - (0 + 0) (the cost of the scalar loads) = 1; 3) The cost of gather is 1 for gather. + 1 for an extract op. The total thus is -1. If we correct that cost of loads, the final cost is going to be 1.

Thank you for the explanation. I thought we were summing uops as the cost calculation, but we're not.

I'm not sure if the current model is best suited for x86 (I'd use uop count as the first approximation for x86 perf), but I guess that's a bigger and independent question. I still don't know SLP that well, so if others are happy with this solution, I have no objections.

Repository:
  rL LLVM

https://reviews.llvm.org/D42981