[PATCH] D98714: [SLP] Add insertelement instructions to vectorizable tree

Fri Apr 9 09:35:13 PDT 2021

ABataev added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:2841-2844
+        if (!llvm::any_of(VL, [&](Value *S) { return Op == S; })) {
+          assert(!FirstInsert && "FirstInsert already found?");
+          FirstInsert = IE;
+        }
----------------
anton-afanasyev wrote:
> ABataev wrote:
> > The first insert has undef vector operand, no?
> > Plus, this has n^2 complexity
> Not necessary. There could be the case we deal with splitted array of inserts:
> ```
>   %rv0  = insertelement <8 x i32> undef, i16 %r0 , i32 0
>   %rv1  = insertelement <8 x i32> %rv0 , i16 %r1 , i32 1
>   %rv2  = insertelement <8 x i32> %rv1 , i16 %r2 , i32 2
>   %rv3  = insertelement <8 x i32> %rv2 , i16 %r3 , i32 3
>   %rv4  = insertelement <8 x i32> %rv3 , i16 %r4 , i32 4
>   %rv5  = insertelement <8 x i32> %rv4 , i16 %r5 , i32 5
>   %rv6  = insertelement <8 x i32> %rv5 , i16 %r6 , i32 6
>   %rv7  = insertelement <8 x i32> %rv6 , i16 %r7 , i32 7
> ```
> Since `<8 x i32>` doesn't fit SSE register, it is splitted to two `<4 x i32>`. The first time we process `%rv0-%rv3`, the next time -- `%rv4-%rv7`.
Hmm, looks like currently it can be only the first element.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:2858-2869
+        // Reschedule bundle instructions, sorting them by def-to-use order
+        ScheduleData *FirstInBundle = BS.getScheduleData(FirstInsert);
+        ScheduleData *Current = BS.getScheduleData(LastInsert);
+        ScheduleData *NextInBundle = nullptr;
+        do {
+          Current->FirstInBundle = FirstInBundle;
+          Current->NextInBundle = NextInBundle;
----------------
anton-afanasyev wrote:
> ABataev wrote:
> > This really looks ugly. Why you can't use the existing functionality?
> Hmm, I'm agree that doesn't look elegant. Let me describe the task I'm solving here.
> For the inserts tree node we have two orders, which could be actually different: index (lane) order and def-use order. Both of these orders are used: index order is for correct lane matching, def-use order is for correct vector replacing and extracting.
> There could be the (theoretical, seldom in practice) case when intermediate insert instruction is used out-of-tree:
> ```
> %rv0  = insertelement <4 x i32> undef, i16 %r0 , i32 3
> %rv1  = insertelement <4 x i32> %rv0 , i16 %r1 , i32 1
> %rv2  = insertelement <4 x i32> %rv1 , i16 %r2 , i32 0
> %rv3  = insertelement <4 x i32> %rv2 , i16 %r3 , i32 2
> ...
> %a = shufflevector <4 x i32> %rv2, <4 x i32> undef, ...
> ```
> So we have to use def-use order to extract only 3, 1 and 0 indices here after vectorization.
> 
> But your comment made me think to pass already def-use sorted inserts to `tryToVectorizeList()` and to use `TreeEntry::ReorderIndices` to store index order. I'm to try this approach.
I think in this case you need to vectorize only `%rv0-%rv2` sequence, `%rv3` should remain as an insert

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D98714/new/

https://reviews.llvm.org/D98714