[PATCH] D32827: [AArch64] Correct lane zero optimization in insert/extract costs

Mon May 8 14:32:40 PDT 2017

anemet added a comment.

In https://reviews.llvm.org/D32827#745349, @mssimpso wrote:

> Hi Adam,
>
> > Actually, this is also true if the insert is fed by a load.  In this case we can just directly load into the vector register.  In my recent experience with SLP this seemed like a pretty important case which is changing with this patch.  How does performance look?
>
> That's right. That actually should be true for all load/insert sequences of legal types, not just ones that insert into lane zero - we should generate LD1s for all lanes. So I think that could be an additional optimization, probably in a separate patch?

Sounds good.

> Regarding performance, this is generally beneficial for our cores (Kryo, Falkor), but our base insert/extract cost (2) is already lower than the default (3), so the effect may be somewhat different.

For perf changes like this, it would be great if we could have a more details analysis of the changed hotspots (like what I did for the 64-bit SLP).  Opt-viewer is a great tool for this especially with opt-diff which will tell you the changes in SLP vectorization in the hotspots (if you run it with PGO).  Unfortunately, I didn't have time to clean up my patches that add SLP vectorization remarks so you can't use it yet :(.

>> What does this mean in terms of the change of cost?  3->1?  Inserts are I think pretty expensive even within-class because they represent partial-writes.  I was surprised to discover recently that extracts and inserts have the same costs.
> 
> The changes to the default costs can be seen in the new cost model test I added. Basically, for integer and pointer types, things stay the same at 3 (but now lane zero is also 3). For floating-point, lane zero stays the same at 0 (but now the other lanes are at 1). But I'm happy leaving the float-point non-zero lanes at 3 if you prefer, and considering them instead in a follow-on patch if necessary.

These make sense to me.

I am wondering if a better abstraction would be if we just had a cost for cross-domain moves rather then using these magic values.  I guess it doesn't matter if this is the only place where we have this logic.

> My priority here is fixing the "extracting from lane zero costs nothing" issue, which is not true for integer types. If this change is too scary all at once, we could start with that case and tackle the issues/optimizations one-by-one. What do you think?

Yes, we should definitely commit this in two pieces.

https://reviews.llvm.org/D32827