[PATCH] D21251: [TTI] The cost model should not assume illegal vector casts get completely scalarized

Fri Jun 17 16:26:47 PDT 2016

mkuper added a comment.

Could someone on the X86 side verify that this is sane-ish?
Some costs dropped by a very large factor, and the new costs look better, but I'd appreciate another set of eyes.


================
Comment at: test/Analysis/CostModel/AMDGPU/addrspacecast.ll:39
@@ -38,3 +38,3 @@
 ; CHECK: 'addrspacecast_local_to_flat_v32'
-; CHECK: estimated cost of 32 for {{.*}} addrspacecast <32 x i8 addrspace(3)*> %ptr to <32 x i8 addrspace(4)*>
+; CHECK: estimated cost of 47 for {{.*}} addrspacecast <32 x i8 addrspace(3)*> %ptr to <32 x i8 addrspace(4)*>
 define <32 x i8 addrspace(4)*> @addrspacecast_local_to_flat_v32(<32 x i8 addrspace(3)*> %ptr) #0 {
----------------
arsenm wrote:
> arsenm wrote:
> > mkuper wrote:
> > > arsenm wrote:
> > > > Pretty much everything should be scalarized. The vector insert and extracts are supposed to be free (and the cost is reported as 0 for those) so I think adding the one there is inconsistent and should check the extract/insert cost
> > > Until now, we assumed scalarization, but I think this is actually the rare case in practice. If the platform cares about vectors, I'd expect it to support most vector operations at least at some vector width, so it usually won't scalarize. And if we assume partial splitting instead of scalarization, using the insert/extract costs will be the wrong thing, regardless of how imprecise "1" is for splitting (and it's definitely imprecise, but it's what the generic getTypeLegalizationCost() uses).
> > > 
> > > We could trace the entire legalization chain, and see whether the end result is a vector or a scalar, and then use either 1 or the getScalarizationOvehead() based on that, but I'm not a huge fan of that.
> > > (Is full scalarization common on AMDGPU, or is this a corner case? If it's common, perhaps we should specialize this for the AMDGPU TTI.)
> > There are no vector operations.Vectors are only for loading and storing, every operation is scalar
> There's also no additional scalarization cost since access of a vector element is just access of a subregister
Ah, I see.

I could add a TTI hook for "split cost", and set that to 0 for AMDGPU. But I'm not entirely happy with that either. How does that sound to you?
Any other suggestions that will make this work correctly for both AMDGPU and platforms that legalize vectors partially?



http://reviews.llvm.org/D21251