[llvm] [AMDGPU] Vectorize i8 Shuffles (PR #105850)

Wed Oct 16 12:26:26 PDT 2024

================
@@ -306,6 +306,23 @@ bool GCNTTIImpl::hasBranchDivergence(const Function *F) const {
   return !F || !ST->isSingleLaneExecution(*F);
 }
 
+unsigned GCNTTIImpl::getNumberOfParts(Type *Tp) {
+  // For certain 8 bit ops, we can pack a v4i8 into a single part
+  // (e.g. v4i8 shufflevectors -> v_perm v4i8, v4i8). Thus, we
----------------
jrbyrnes wrote:

Ultimately, I still think it is the better approach to have SLP handle this type of vectorization (as opposed to writing a data-flow vectorizer). I think it should be left up to the cost model to accurately measure the quality and filter all but those few ops we are interested in. That was the intent of this PR, but it seems I'll need to tune the costs for calling convetion at least.

https://github.com/llvm/llvm-project/pull/105850