[PATCH] D32827: [AArch64] Correct lane zero optimization in insert/extract costs

Wed May 3 15:02:03 PDT 2017

mssimpso added a comment.

Hi Adam,

> Actually, this is also true if the insert is fed by a load.  In this case we can just directly load into the vector register.  In my recent experience with SLP this seemed like a pretty important case which is changing with this patch.  How does performance look?

That's right. That actually should be true for all load/insert sequences of legal types, not just ones that insert into lane zero - we should generate LD1s for all lanes. So I think that could be an additional optimization, probably in a separate patch? Regarding performance, this is generally beneficial for our cores (Kryo, Falkor), but our base insert/extract cost (2) is already lower than the default (3), so the effect may be somewhat different.

> What does this mean in terms of the change of cost?  3->1?  Inserts are I think pretty expensive even within-class because they represent partial-writes.  I was surprised to discover recently that extracts and inserts have the same costs.

The changes to the default costs can be seen in the new cost model test I added. Basically, for integer and pointer types, things stay the same at 3 (but now lane zero is also 3). For floating-point, lane zero stays the same at 0 (but now the other lanes are at 1). But I'm happy leaving the float-point non-zero lanes at 3 if you prefer, and considering them instead in a follow-on patch if necessary.

My priority here is fixing the "extracting from lane zero costs nothing" issue, which is not true for integer types. If this change is too scary all at once, we could start with that case and tackle the issues/optimizations one-by-one. What do you think?

https://reviews.llvm.org/D32827