[PATCH] D116270: [AMDGPU] Enable divergence-driven XNOR selection

Fri Jan 21 03:40:05 PST 2022

alex-t added a comment.

Initial DAG
F21741482: image.png <https://reviews.llvm.org/F21741482>
DAG after the transformation and constant folding
F21741506: image.png <https://reviews.llvm.org/F21741506>
This can be selected to v_perm_b32

================
Comment at: llvm/lib/Target/AMDGPU/SIISelLowering.cpp:12491-12493
+    if (isCommutativeBinOp(N1->getOpcode()) &&
+        DAG.isConstantIntBuildVectorOrConstantInt(N1->getOperand(1)))
+      return true;
----------------
foad wrote:
> alex-t wrote:
> > foad wrote:
> > > I don't understand this heuristic. Can you give an example of when it would help?
> > I could just demonstrate the concrete example but I would need to paste the DAGs here that look like overkill.  So, I try to explain w/o the drawing. 
> > Let's imagine we have a sub-tree constituting the commutative arithmetic operations.
> > Let us have a path in the tree such that each node has at least one operant constant.
> > Given that it is very likely that this sub-tree is going to be simplified by the combiner by application arithmetic rules and constant folding.
> > This heuristic states the priority of such constant folding over keeping the outer node uniform.
> > 
> > ```
> >   %and = and i32 %tmp, 16711935     ; 0x00ff00ff
> >   %tmp1 = and i32 %arg1, 4294967040 ; 0xffffff00
> >   %tmp2 = or i32 %tmp1, -65536
> >   %tmp3 = or i32 %tmp2, %and
> > 
> > ```
> > This is folded and can be selected to v_perm_b32 with this heuristic but will be 4 scalar operations w/o it.
> I still don't see why this would be useful //in general//. I think it means we should do this reassociation:
> `(op (op n00, C), (op2 n10, C2)) --> (op (op n00, (op2 n10, C2)), C)`
> where op2 is commutative but not necessarily the same as op. E.g. `(x|C)|(z&C2) --> (x|(z&C2))|C`
I did not say that it is useful in all cases. I said that there is a high probability that the 2 or more constant operands in the same commutative arithmetic sub-tree will be folded by further steps of the combining. This is just a heuristic.
And that is why I said that I don't like the combiner approach in general.
As soon as we influence the divergence in the one combining algorithm, we must teach all other combining algorithms to leverage the divergence information. This is a huge amount of work. Without this, we likely restrict the optimization opportunity just because we change just one of the peephole DAG transformations by introducing the new factor unknown to the rest of the combiner.
In my particular example, this heuristic let us get

```
	v_mov_b32_e32 v3, 0xffff0500
	v_perm_b32 v2, s0, v2, v3
```
instead of

```	s_and_b32 s0, s0, 0xff00
	s_or_b32 s0, s0, 0xffff0000
	v_and_b32_e32 v2, 0xff00ff, v2
	v_or_b32_e32 v2, s0, v2
```
In general, I agree that this is a fragile design and we will have to add more and more exemptions to keep all the optimizations working.
There are 2 clean ways:

  - Divergence-driven DAG combiner

  - Post ISel MIR optimizer

I consider the last one is much simpler to implement.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D116270/new/

https://reviews.llvm.org/D116270