[PATCH] D116270: [AMDGPU] Enable divergence-driven XNOR selection

Fri Jan 21 09:47:49 PST 2022

foad added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/SIISelLowering.cpp:12491-12493
+    if (isCommutativeBinOp(N1->getOpcode()) &&
+        DAG.isConstantIntBuildVectorOrConstantInt(N1->getOperand(1)))
+      return true;
----------------
alex-t wrote:
> foad wrote:
> > alex-t wrote:
> > > foad wrote:
> > > > alex-t wrote:
> > > > > foad wrote:
> > > > > > I don't understand this heuristic. Can you give an example of when it would help?
> > > > > I could just demonstrate the concrete example but I would need to paste the DAGs here that look like overkill.  So, I try to explain w/o the drawing. 
> > > > > Let's imagine we have a sub-tree constituting the commutative arithmetic operations.
> > > > > Let us have a path in the tree such that each node has at least one operant constant.
> > > > > Given that it is very likely that this sub-tree is going to be simplified by the combiner by application arithmetic rules and constant folding.
> > > > > This heuristic states the priority of such constant folding over keeping the outer node uniform.
> > > > > 
> > > > > ```
> > > > >   %and = and i32 %tmp, 16711935     ; 0x00ff00ff
> > > > >   %tmp1 = and i32 %arg1, 4294967040 ; 0xffffff00
> > > > >   %tmp2 = or i32 %tmp1, -65536
> > > > >   %tmp3 = or i32 %tmp2, %and
> > > > > 
> > > > > ```
> > > > > This is folded and can be selected to v_perm_b32 with this heuristic but will be 4 scalar operations w/o it.
> > > > I still don't see why this would be useful //in general//. I think it means we should do this reassociation:
> > > > `(op (op n00, C), (op2 n10, C2)) --> (op (op n00, (op2 n10, C2)), C)`
> > > > where op2 is commutative but not necessarily the same as op. E.g. `(x|C)|(z&C2) --> (x|(z&C2))|C`
> > > I did not say that it is useful in all cases. I said that there is a high probability that the 2 or more constant operands in the same commutative arithmetic sub-tree will be folded by further steps of the combining. This is just a heuristic.
> > > And that is why I said that I don't like the combiner approach in general.
> > > As soon as we influence the divergence in the one combining algorithm, we must teach all other combining algorithms to leverage the divergence information. This is a huge amount of work. Without this, we likely restrict the optimization opportunity just because we change just one of the peephole DAG transformations by introducing the new factor unknown to the rest of the combiner.
> > > In my particular example, this heuristic let us get
> > > 
> > > ```
> > > 	v_mov_b32_e32 v3, 0xffff0500
> > > 	v_perm_b32 v2, s0, v2, v3
> > > ```
> > > instead of
> > > 
> > > ```	s_and_b32 s0, s0, 0xff00
> > > 	s_or_b32 s0, s0, 0xffff0000
> > > 	v_and_b32_e32 v2, 0xff00ff, v2
> > > 	v_or_b32_e32 v2, s0, v2
> > > ```
> > > In general, I agree that this is a fragile design and we will have to add more and more exemptions to keep all the optimizations working.
> > > There are 2 clean ways:
> > > 
> > > 
> > >   - Divergence-driven DAG combiner
> > > 
> > > 
> > >   - Post ISel MIR optimizer
> > > 
> > > I consider the last one is much simpler to implement.
> > > 
> > > 
> > > 
> > > 
> > I understand why this heuristic helps you form v_perm_b32 in **one specific case**. I don't understand why "there is a **high probability** that the 2 or more constant operands in the same commutative arithmetic sub-tree will be folded by further steps of the combining". As far as I can see, it is just luck that it helps this particular v_perm_b32 combine, and it could just as easily make another case **fail** to combine. Do you have any evidence that it increases the total number of v_perm_b32 combines across a large corpus of code?
> Yes. You are right. I cannot prove that this will help in another DAG pattern.
> This is exactly what I tried to explain from the beginning.
> The only reliable way to change the "manual" selection that was done in SIFixSGPRCopies + SIInstrInfo::moveToVALU, is the fair divergence-driven selection and further MIR processing.
> Any attempts to add divergence information to one separate combiner transformation will lead to such "heuristics" that fix case by case. It will happen because the combiner is the peephole optimizer and it was designed based on the common arithmetic rules without the "divergence". the divergence is target-specific and should not affect the DAG combiner.
> Summarizing all the above: without this hack, I have a regression.  
> And I expect there will be more such hacks in case we go on to influence the combiner with divergence-driven transformations. Another approach - post-ISel MIR optimization looks to you too complicated. 
> I have no other ideas so far.
> BTW I am not even sure that 4 scalar arithmetic operations instead of v_perm_b32 are always really bad. We increase code size but save one VGPR.  In some cases with high VGPR reg pressure, it is good. In general, it is not easy to formalize - in which cases the transformation based on the node divergence is profitable,
> Summarizing all the above: without this hack, I have a regression.

I suggest ignoring the regression. And if it turns out to be important, I suggest fixing it by improving v_perm pattern recognition in SITargetLowering::performOrCombine, not by tweaking isReassocProfitable.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D116270/new/

https://reviews.llvm.org/D116270