[PATCH] D98714: [SLP] Add insertelement instructions to vectorizable tree

Anton Afanasyev via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Apr 9 10:07:43 PDT 2021


anton-afanasyev marked 4 inline comments as done.
anton-afanasyev added inline comments.


================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:2841-2844
+        if (!llvm::any_of(VL, [&](Value *S) { return Op == S; })) {
+          assert(!FirstInsert && "FirstInsert already found?");
+          FirstInsert = IE;
+        }
----------------
ABataev wrote:
> anton-afanasyev wrote:
> > ABataev wrote:
> > > anton-afanasyev wrote:
> > > > ABataev wrote:
> > > > > The first insert has undef vector operand, no?
> > > > > Plus, this has n^2 complexity
> > > > Not necessary. There could be the case we deal with splitted array of inserts:
> > > > ```
> > > >   %rv0  = insertelement <8 x i32> undef, i16 %r0 , i32 0
> > > >   %rv1  = insertelement <8 x i32> %rv0 , i16 %r1 , i32 1
> > > >   %rv2  = insertelement <8 x i32> %rv1 , i16 %r2 , i32 2
> > > >   %rv3  = insertelement <8 x i32> %rv2 , i16 %r3 , i32 3
> > > >   %rv4  = insertelement <8 x i32> %rv3 , i16 %r4 , i32 4
> > > >   %rv5  = insertelement <8 x i32> %rv4 , i16 %r5 , i32 5
> > > >   %rv6  = insertelement <8 x i32> %rv5 , i16 %r6 , i32 6
> > > >   %rv7  = insertelement <8 x i32> %rv6 , i16 %r7 , i32 7
> > > > ```
> > > > Since `<8 x i32>` doesn't fit SSE register, it is splitted to two `<4 x i32>`. The first time we process `%rv0-%rv3`, the next time -- `%rv4-%rv7`.
> > > Hmm, looks like currently it can be only the first element.
> > Why? We process long insert chains slice by slice:
> > https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp#L6408
> Yes, I mean the first element in the slice, i.e. `VL[0]` since we do not allow reordering of insertelement instructions, no?
The inserts passed to `tryToVectorizeList()` are sorted by index (by `findBuildAggregate()`), so the first element defined by def-use order could mismatch to `VL[0]`. For instance:
```
%rv0  = insertelement <8 x i32> undef, i16 %r0 , i32 3
%rv1  = insertelement <8 x i32> %rv0 , i16 %r1 , i32 2
%rv2  = insertelement <8 x i32> %rv1 , i16 %r2 , i32 0
%rv3  = insertelement <8 x i32> %rv2 , i16 %r3 , i32 1
%rv4  = insertelement <8 x i32> %rv3 , i16 %r4 , i32 4
%rv5  = insertelement <8 x i32> %rv4 , i16 %r5 , i32 5
%rv6  = insertelement <8 x i32> %rv5 , i16 %r6 , i32 6
%rv7  = insertelement <8 x i32> %rv6 , i16 %r7 , i32 7
```


================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:2858-2869
+        // Reschedule bundle instructions, sorting them by def-to-use order
+        ScheduleData *FirstInBundle = BS.getScheduleData(FirstInsert);
+        ScheduleData *Current = BS.getScheduleData(LastInsert);
+        ScheduleData *NextInBundle = nullptr;
+        do {
+          Current->FirstInBundle = FirstInBundle;
+          Current->NextInBundle = NextInBundle;
----------------
ABataev wrote:
> anton-afanasyev wrote:
> > ABataev wrote:
> > > anton-afanasyev wrote:
> > > > ABataev wrote:
> > > > > This really looks ugly. Why you can't use the existing functionality?
> > > > Hmm, I'm agree that doesn't look elegant. Let me describe the task I'm solving here.
> > > > For the inserts tree node we have two orders, which could be actually different: index (lane) order and def-use order. Both of these orders are used: index order is for correct lane matching, def-use order is for correct vector replacing and extracting.
> > > > There could be the (theoretical, seldom in practice) case when intermediate insert instruction is used out-of-tree:
> > > > ```
> > > > %rv0  = insertelement <4 x i32> undef, i16 %r0 , i32 3
> > > > %rv1  = insertelement <4 x i32> %rv0 , i16 %r1 , i32 1
> > > > %rv2  = insertelement <4 x i32> %rv1 , i16 %r2 , i32 0
> > > > %rv3  = insertelement <4 x i32> %rv2 , i16 %r3 , i32 2
> > > > ...
> > > > %a = shufflevector <4 x i32> %rv2, <4 x i32> undef, ...
> > > > ```
> > > > So we have to use def-use order to extract only 3, 1 and 0 indices here after vectorization.
> > > > 
> > > > But your comment made me think to pass already def-use sorted inserts to `tryToVectorizeList()` and to use `TreeEntry::ReorderIndices` to store index order. I'm to try this approach.
> > > I think in this case you need to vectorize only `%rv0-%rv2` sequence, `%rv3` should remain as an insert
> > No, full-inserted vector could be used as well:
> > ```
> > ...
> > %a = shufflevector <4 x i32> %rv2, <4 x i32> undef, ...
> > %b = shufflevector <4 x i32> %rv3, <4 x i32> undef, ...
> > ```
> > So we vectorize all lanes, and then replace `%rv3` by `VectorizedValue` and replace `%rv2` by `VectorizedValue`, preceded by shuffle of `SK_ExtractVector` kind.
> I would vectorize `%rv0-%rv2` and kept `%rv3` as is, i.e. something like this:
> ```
> %vec = vector<%rv0, %rv1, %rv2>
> %rv3 = insertelement <4 x i32> %vec , i16 %r3 , i32 2
> %a = shufflevector <4 x i32> %vec, <4 x i32> undef, ...
> ```
This was just an simple example, one can imagine more complex case where whole vectorization and further extracting is rather cheap to do, so the decision is eventually made by cost. We use `getShuffleCost()` for extract cost at line 4225.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D98714/new/

https://reviews.llvm.org/D98714



More information about the llvm-commits mailing list