[PATCH] D131118: [LV] Add generic scalarization support for unpredicated scalable vectors

Tue Aug 16 01:15:11 PDT 2022

fhahn added inline comments.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/sve-inv-store.ll:28
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x i16>, <vscale x 4 x i16>* [[TMP7]], align 2
-; CHECK-NEXT:    call void @llvm.masked.scatter.nxv4i16.nxv4p0i16(<vscale x 4 x i16> [[WIDE_LOAD]], <vscale x 4 x i16*> [[BROADCAST_SPLAT]], i32 2, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
-; CHECK-NEXT:    [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
----------------
reames wrote:
> david-arm wrote:
> > Hi @reames, we definitely don't want to be doing this for SVE as it will likely hurt performance - scatters are still likely to be better. We made a conscious effort to avoid scalaring this way for SVE because we managed to find other ways of solving the same problems using existing instructions. Also, if we ever encountered a situation where we'd have to scalarise a scalable vector then the performance will likely to be terrible and so we choose to return an Invalid cost from the cost model and skip that VF. We may as well just use NEON or not vectorise, because this is almost certainly better.
> > 
> > What use cases are you trying to solve here? This patch doesn't seem to fix any actual bugs, so I'm assuming this is for performance reasons. It looks like this change only affects one test (`@uniform_store_of_loop_varying`) and I guess the IR in this test is not a common idiom. Have you tested performance before and after to see if this is worthwhile?
> The motivation here is primarily robustness - both correctness wise and performance.
> 
> On the correctness front, the existing code pretty fundamentally assumes that scalarization is an available strategy.  The invalid bailouts for scalable has been added in, but I've already hit (and fixed) several cases where this was incomplete.  From a code complexity, and likelihood of bugs perspective, having fixed and scalable have the same *capabilities* seems highly useful.
> 
> On the performance side, there's a two level argument here.
> 
> For many cases, you're entirely right that there are other strategies available.  You'll note that I have a patch on review right now which implements a more optimal strategy for divides.  However, we have not implemented all of them.  As a result, the current situation is that these cases become *strictly* non-vectorizeable.  This means that even otherwise hugely profitable to vectorize loops are prevented from vectorization.  This is basically a "order of work" argument to minimize the pain while that work is happening.
> 
> However, there are also cases which we can vectorize, but fundamentally *don't* have a better strategy for.  One that we could vectorize - but don't today - is a call to a readonly function in an otherwise vectorizable loop.  We can't assume the existence of a vector variant of the call (since it's not a known function), but we could scalarize in general.  Now, put that call down a rarely taken predicated block and you start seeing why having the *capability* of scalarization is useful.  
> 
> Uniform memops are only a testing vehicle here.  Nothing more.  Please don't fixate on the generality of a uniform store test as that is not what this patch is really about.  
> 
> For aarch64, if scatters are profitable and legal, then the code should pick that already.  This is an example where the cost model believes scalarization is profitable.  It didn't make it into the final patch, but I played with some pretty serious cost penalties, and this case didn't change behavior.  If you have particular suggestions on how to adjust cost to penalize scalarizing "enough", I'm completely open to them.
In general it seems like having this extra lowering strategy would be desirable with respect to guaranteeing that we will always be able to scalarize any construct, rather than crash. This is an assumption made in various places.

It should be up to the cost model to avoid scalarizing in cases where it won't be profitable, but I can't comment on this particular test case or if there are scenarios where this will profitable in general in practice. 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D131118/new/

https://reviews.llvm.org/D131118