[PATCH] D125712: [SLP][X86] Improve reordering to consider alternate instruction bundles

Wed May 25 11:36:09 PDT 2022

vporpo added inline comments.

================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/reorder_with_external_users.ll:121-123
+; CHECK-NEXT:    [[TMP2:%.*]] = fsub <2 x double> [[TMP1]], <double 1.100000e+00, double 1.200000e+00>
+; CHECK-NEXT:    [[TMP3:%.*]] = fadd <2 x double> [[TMP1]], <double 1.100000e+00, double 1.200000e+00>
+; CHECK-NEXT:    [[TMP4:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> [[TMP3]], <2 x i32> <i32 0, i32 3>
----------------
ABataev wrote:
> vporpo wrote:
> > ABataev wrote:
> > > vporpo wrote:
> > > > ABataev wrote:
> > > > > vporpo wrote:
> > > > > > ABataev wrote:
> > > > > > > I don't quite understand what's the difference here. Could you explain, please?
> > > > > > Before this patch the pattern `shuffle + fadd + fsub` lowers to 3 instructions: blend  + vector add + vector sub (the shuffle selects TMP3[0],TMP2[1], which is fadd[0],fsub[1] , the inverse of the addsub pattern).
> > > > > > 
> > > > > > With this patch `shuffle + fadd +fsub` lowers to a single addsub instruction (the shuffle selects TMP2[0], TMP3[2] which is fsub[0],fadd[1]). 
> > > > > > This saves 2 instructions  which means that during reordering we should keep track of this pattern since reordering it can increase the overhead. 
> > > > > Ok, why not fixing it in the backend? This new function you added does not affect the cost, but ignoring shuffle actually increases the cost of the tree.
> > > > I think it is better to do it here because we may end up not vectorizing some code because of the additional cost of the blend + add + sub pattern. I think I can fix the tree cost with a follow-up patch that fixes the cost of the altshuffle pattern when it corresponds to the addsub instruction.
> > > Ah, I see, thanks. What about trying to do both - lowering in the backend and the cost adjustment? Is it possible?
> > Hmm I guess in the back-end we won't have access to cost benefit analysis like the one we are doing during the reordering step (i.e., finding the most popular order). So we would have to do a simple conversion of any sub-add pattern to an add-sub + shuffles, but I am not sure that this would always be profitable. I think that the addsub pattern should be taken into consideration when looking for the most popular ordering so that it can influence the decision, but I am not so sure we can always justify the cost of the extra shuffles.
> The backend does not need the cost, it sjut checks the pattern and lowers the sequence to the instructions. Why does it need the cost?
I think there is a difference between doing the transformation here vs in the back-end. In the back-end we can't easily check if reordering a sub-add to an add-sub can also remove some of the shuffles that are already in the code.

For example if we have code like this:
```
%vsub = fsub <2 x double> %subop, ...
%vadd = fadd <2 x double> %addop, ...
%shuffle = shufflevector <2 x double> %vsub, %vadd, 0, 3  # vsub[0], vadd[1]
store <2 x double> %shuffle, ...
```
This will be lowered to: `add + sub + blend + store`.

But if we convert the `sub-add` pattern (i.e., `add + sub + blend`) to an `add-sub + shuffles` in the back-end, then this will introduce 3 shuffles: 2 for the operands and 1 for the user, resulting in a pattern: `pshuf + pshuf + addsub + pshuf + store`.  This is clearly worse than the original code. 

But the transformation could be profitable if we could remove some of the shuffles, for example if the code looked like this:
```
%vsub = fsub <2 x double> %subop, ...
%vadd = fadd <2 x double> %addop, ...
%shuffle = shufflevector <2 x double> %vsub, %vadd, 0, 3  # vsub[0], vadd[1]
%shuffle2 = shufflevector <2 x double> %shuffle, zero, 1, 0 # shuffle[1], shuffle[0]
store <2 x double> %shuffle, ...
```
Converting this to sub-add in the back-end would result in `pshuf + pshuf + addsub + pshuf + pshuf + store`, but the latter two `pshuf` instructions negate each other so the could be optimized away. A similar optimization could happen if the inputs were already reordered with a shuffle earlier.

What I am trying to say is that simply converting a `sub-add` pattern to an `add-sub` pattern does not look profitable unless we can also get rid of some of the shuffles that cancel each other out. I can't think of how we could do this in the back-end more effectively than we could do it here, but perhaps I am missing  something.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D125712/new/

https://reviews.llvm.org/D125712