[PATCH] D107790: [RISCV] Add a pass to recognize VLS strided loads/store from gather/scatter.

Thu Aug 19 01:41:25 PDT 2021

frasercrmck added inline comments.

================
Comment at: llvm/lib/Target/RISCV/RISCVGatherScatterLowering.cpp:238
+  // Make sure we have a splat.
+  Value *SplatOp = getSplatValue(OtherOp);
+  if (!SplatOp)
----------------
craig.topper wrote:
> rogfer01 wrote:
> > rogfer01 wrote:
> > > One interesting difference between fixed and scalable is that fixed vectors embed a iota vector as a constant in a vector as the loop header incoming value.
> > > 
> > > Like this:
> > > 
> > > ```lang=llvm
> > > vector.body:
> > >   %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]                 
> > >   %vec.ind = phi <32 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11, i64 12, i64 13, i64 14, i64 15, i64 16, i64 17, i64 18, i64 19, i64 20, i64 21, i64 22, i64 23, i64 24, i64 25, i64 26, i64 27, i64 28, i64 29, i64 30, i64 31>, %entry ], [ %vec.ind.next, %vector.body ]
> > >   %0 = mul nuw nsw <32 x i64> %vec.ind, <i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5>
> > >   %1 = getelementptr inbounds i8, i8* %B, <32 x i64> %0
> > > ```
> > > 
> > > However with scalable vectorisation (see https://www.godbolt.org/z/Gchx863os ) the vector phi is gone and the iota vector (stepvector) coming from the header is used to compute the vector of indices.
> > > 
> > > ```lang=llvm
> > > vector.ph:                                        ; preds = %entry
> > >   %4 = call <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64(), !dbg !22
> > >   ...
> > >   br label %vector.body, !dbg !24
> > > vector.body:                                      ; preds = %vector.body, %vector.ph
> > >   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ], !dbg !25
> > >   %.splatinsert11 = insertelement <vscale x 2 x i64> poison, i64 %index, i32 0, !dbg !24
> > >   %.splat12 = shufflevector <vscale x 2 x i64> %.splatinsert11, <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer, !dbg !24
> > >   %7 = add <vscale x 2 x i64> %.splat12, %4, !dbg !24
> > >   %8 = mul nuw nsw <vscale x 2 x i64> %7, shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 5, i32 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer), !dbg !27
> > >   %9 = getelementptr inbounds i8, i8* %B, <vscale x 2 x i64> %8, !dbg !27
> > > ```
> > > 
> > > So at this point the algorithm needs to diverge a bit because now the `phi` won't be the base case (there won't be a vector `phi`). Instead I understand we need to determine we're splatting a scalar recurrence and combining it with a `stepvector`.
> > > 
> > > Not that we have to address it now. We may have to bear it in mind in the future if we plan to extend this to scalable vectors.
> > On a second thought, it may happen that stepvector gets optimised in a way that the vector phi is used (similar to the fixed case) so the difference goes away (being able to carry the vector of indices through the loop seems better than synthesising it fully in every iteration).
> That's interesting that the form is different. Is that because the induction variable step would need to be a splat of vscale * fixed element count that would also need to be created?
Hmm so @roger01 spurred me to actually see what we're generating and now I think I may have to bow out of the conversation.

Our vectorizer is doing something sufficiently different such that we see a scalar PHI for both fixed- and scalable vectorization. I'll post what we're doing in case it's at all useful for others, but I think I'll have to find some other solution that meets our needs if the following IR isn't something we're going to see coming out of the in-tree vectorizer(s).

Fixed:
``` lang=llvm
vector_body:                        ; preds = %vector_body, %vector_ph
  %lsr.iv4 = phi i32 addrspace(1)* [ %scevgep5, %vector_body ], [ %scevgep3, %vector_phi ]

  %19 = getelementptr i32, i32 addrspace(1)* %lsr.iv4, <4 x i64> <i64 0, i64 2, i64 4, i64 6>
  %20 = tail call <4 x i32> @llvm.masked.gather.v4i32.v4p1i32(<4 x i32 addrspace(1)*> %19, i32 immarg 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef) #5, !noalias !28
```
Scalable:
``` lang=llvm
vector_ph:
  %15 = tail call <vscale x 1 x i64> @llvm.experimental.stepvector.nxv1i64() #5
  %16 = shl <vscale x 1 x i64> %15, shufflevector (<vscale x 1 x i64> insertelement (<vscale x 1 x i64> undef, i64 1, i32 0), <vscale x 1 x i64> undef, <vscale x 1 x i32> zeroinitializer)
  ...

vector_body:                        ; preds = %vector_body, %vector_ph
  %lsr.iv5 = phi i32 addrspace(1)* [ %29, %vector_body ], [ %scevgep4, %vector_ph ]

  %24 = getelementptr i32, i32 addrspace(1)* %lsr.iv5, <vscale x 1 x i64> %16
  %25 = tail call <vscale x 1 x i32> @llvm.masked.gather.nxv1i32.nxv1p1i32(<vscale x 1 x i32 addrspace(1)*> %24, i32 immarg 4, <vscale x 1 x i1> shufflevector (<vscale x 1 x i1> insertelement (<vscale x 1 x i1> poison, i1 true, i32 0), <vscale x 1 x i1> poison, <vscale x 1 x i32> zeroinitializer), <vscale x 1 x i32> undef) #6, !noalias !28
```

So I reckon it's definitely easier to pattern match into strided accesses and is uniform across the two vector types, but that's by the by.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D107790/new/

https://reviews.llvm.org/D107790