[PATCH] D133441: [SLP] Look ahead for mutual horizontal reductions.

Sat Sep 10 04:49:33 PDT 2022

labrinea added a comment.

>> One solution could be to collect the seeds that resulted in failed trees due to external users (and perhaps filter them more by checking if the users are actually seeds), and revisit the seeds once again. I think this is more or less what Alexey is describing.
>
> Yep, exactly! At least looks so :)

I am not sure how would I revisit the seeds though without changing the code. Also can you clarify what do you mean by seeds in this context? The root instruction of the reduction, the list of reduced values passed to buildTree, something else?

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:11035-11038
+    for (ArrayRef<Value *> RdxOps : ReductionOps)
+      for (Value *RdxOp : RdxOps)
+        if (RdxOp)
+          V.addHorizontalReduction(RdxOp);
----------------
ABataev wrote:
> labrinea wrote:
> > ABataev wrote:
> > > labrinea wrote:
> > > > ABataev wrote:
> > > > > labrinea wrote:
> > > > > > ABataev wrote:
> > > > > > > Not sure it is a good idea to store reduction ops here, you're working with reduced values actually.
> > > > > > If you look at the reproducer:
> > > > > > 
> > > > > > ```
> > > > > > for i = ...
> > > > > >   sm += x[i];
> > > > > >   sq += x[i] * x[i];
> > > > > > ```
> > > > > > the addition `sm += x[i]` (which is a reduction op of the first horizontal reduction) is an external user of the scalar load from the multiplication `x[i] * x[i]` (which is a reduction value of the second horizontal reduction)
> > > > > Aha, this is for external users. Then why do you need to store there reduced values?
> > > > Because similarly the multiplication (reduced value) is an external user of the scalar load from the other horizontal reduction.
> > > Ah, I see what do you mean. But this is not the best place for gathering the reduced values, they must be gathered during an attempt to build the tree, since they might be not vectorized but gathered. If they are gathered, you cannot treat them as a future vector.
> > I see. In other words we cannot speculate that the vectorization will eventually happen before trying to actually reduce, so the look ahead approach I am suggesting is rather unsafe? By the time we are attempting to build the tree it is already too late.
> You can gather such "vectorizable" scalars during/after the SLP graph building and store them in the set for the future analysis. Later, next SLP graphs will check if such scalars are potentially vectorizable and exclude the extract cost for them. The, SLP sees, that some part of the basic block was vectorized and restarts to try to vectorize other parts of the same block.
That doesn't happen with horizontal reductions. If a reduction fails then its root instruction is put in the postponed list for later processing, where it's not treated as a reduction any more. At least that's my understanding.

I could indeed cache the scalars of a vectorizable tree entry later on, maybe inside getTreeCost (where the extract cost is available, allowing a more informed decision). However the reduction ops themselves are not part of the tree entry, and so we would still need to cache them, perhaps via the ignore list.

Also, the horizontal reduction is iteratively modeled by several vectorization trees of different factor. Which one do I pick to cache? All of them?

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:11778
   SmallPtrSet<Value *, 8> VisitedInstrs;
+  Optional<HorizontalReduction> PendingReduction = None;
   bool Res = false;
----------------
ABataev wrote:
> labrinea wrote:
> > ABataev wrote:
> > > labrinea wrote:
> > > > ABataev wrote:
> > > > > Why do you need this one? In case of successful reduction, the vectorizer restarts the analysis and rebuilds the reduction graph.
> > > > The idea is to separate the "matching" of a horizontal reduction (matchAssociativeReduction) from the "processing" of it (tryToReduce). That way we can postpone the processing until we have found at least another one. This allows us to identify mutual reductions and ignore the extraction cost of the common scalar values. As Vasilis mentioned this limits the amount of mutual reductions to two, which is not ideal.
> > > No need to postpone it, the pass will just repeat the same steps once again after the changes.
> > I am not following. What should the code look like then? How can we solve the problem of mutual reductions without looking ahead? PendingReduction serves the purpose of one element buffer if that makes it clearer.
> Just adjust the cost, everything else should be handled by the SLP vectorizer pass.
As I explained on the other thread of comments, the unsuccessful reductions are not rescheduled the way you describe. I am still looking into how else could I make it work.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133441/new/

https://reviews.llvm.org/D133441