[PATCH] D32827: [AArch64] Correct lane zero optimization in insert/extract costs

Wed May 3 13:39:21 PDT 2017

anemet added a comment.

Hi Matt,

> In the TTI calculation of vector insert and extract costs, we have an optimization that returns a cost of zero if we are inserting into or extracting from vector lane zero. All other inserts and extracts cost the base amount specified by the sub-target. However, the lane zero optimization only makes sense for floating-point types (i.e., within-class moves). For integer types, we should incur a cost for moving data from vector to general purpose registers, even for lane zero.

Actually, this is also true if the insert is fed by a load.  In this case we can just directly load into the vector register.  In my recent experience with SLP this seemed like a pretty important case which is changing with this patch.  How does performance look?

> This patch modifies the lane zero optimization so that it applies only to floating-point types. Additionally, we now fall back to the base TTI implementation for all other floating-point inserts and extracts. The existing sub-target specified insert/extract costs are used only for the cross-class moves, which I think was probably the original intent. Since the existing code looks like a bug to me, I checked the X86 target, and it implements something similar to what is in this patch.

What does this mean in terms of the change of cost?  3->1?  Inserts are I think pretty expensive even within-class because they represent partial-writes.  I was surprised to discover recently that extracts and inserts have the same costs.

Thanks,
Adam

https://reviews.llvm.org/D32827