[PATCH] D136432: [AMDGPU] Combine BFI instructions.

Mon Oct 31 01:06:53 PDT 2022

jsilvanus added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp:2196-2203
+  do {
+    SDValue O0 = CurrentOr->getOperand(0);
+    SDValue O1 = CurrentOr->getOperand(1);
+    if (O1.getOpcode() != ISD::AND)
+      return false;
+
+    switch (O0.getOpcode()) {
----------------
tsymalla wrote:
> jsilvanus wrote:
> > tsymalla wrote:
> > > nhaehnle wrote:
> > > > tsymalla wrote:
> > > > > nhaehnle wrote:
> > > > > > tsymalla wrote:
> > > > > > > nhaehnle wrote:
> > > > > > > > There is still a weird asymmetry here. Do you have a reason for treating the two operands differently? If not, you should treat them the same.
> > > > > > > The reason is that no transformation will happen if one of the operands doesn't fit the scheme. This is the case if
> > > > > > > 
> > > > > > > a) On the top level, a OR(AND, AND) pattern occurs (this is handled by ISel)
> > > > > > > b) the second operand is no AND in either case (handled by the top-level O1.getOpcode() check)
> > > > > > > c) A nested OR operand is used more than once - in that case, it cannot be optimized away
> > > > > > > d) Any of the AND operands is not a constant value, e. g. doesn't contribute to the mask.
> > > > > > > e) Any of the ANDs is being used more than once - in that case, it cannot be optimized away
> > > > > > > f) If none of the assumptions of the structure can be made, just return false.
> > > > > > > 
> > > > > > > It simplifies things if the operands are handled differently. Even if, for instance, the InsertBFIOp(O1) could theoretically be hoisted out, this complicates things at the BFI generation part later, because now the base and the order at the tail of the vector are swapped, making it impossible to just assign the back (base) to `NewNode` and drop it. So, this loop  basically relies on the state updated from the previous iteration and uses CurrentOr as controlling element.
> > > > > > > b) the second operand is no AND in either case (handled by the top-level O1.getOpcode() check)
> > > > > > 
> > > > > > What's the justification for this rule? Couldn't you have `N = (x1 & c1) | ((x2 & c2) | (x3 & c3))`?
> > > > > In general, you can, but looking at existing tests, is it a) likely this will happen and b) all the other assumptions about the constants are true -  so the compiler will actually create some BFI instructions from the sequence?
> > > > > I can change the implementation to treat the operands equally, but for now, is it likely that this only includes additional checks and memory usage, but will never contribute to the generated code at this point?
> > > > You're asking the question the wrong way around.
> > > > 
> > > > The code that we are inspecting manually is a microscopically tiny fraction of the total amount of code that our compilers ever see. If there is an obvious generalization like this one that is easy to do, we should **always** take it. And when we notice it, we should add lit tests to ensure the case is covered.
> > > > 
> > > > Yes, there are additional checks for the generalization. What additional memory usage is there?
> > > > You're asking the question the wrong way around.
> > > > 
> > > > The code that we are inspecting manually is a microscopically tiny fraction of the total amount of code that our compilers ever see. If there is an obvious generalization like this one that is easy to do, we should **always** take it. And when we notice it, we should add lit tests to ensure the case is covered.
> > > > 
> > > > Yes, there are additional checks for the generalization. What additional memory usage is there?
> > > 
> > > Storing multiple times the bytes for the `SDNode *` and the mask inside the small vector for each OR having an AND as operand even if its likely that the memory will get released in a lot of cases.
> > > Sure, the overhead is neglectable, but I at least wanted to talk about that.
> > > 
> > > I'll just add the generalization in the next diff.
> > The generalization would be to accept arbitrary OR-trees of AND clauses of the form `(x_i & c_i)` or `(c_i & x_i)`?
> Yes. The generalization would be to have the constant operands be in arbitrary order for a given AND, and also to have the nested ORs be at any position for a given OR node.
You still artificially require at least one operand of each OR to be an AND. However, we can have an arbitrary OR-tree that you would need to traverse using a stack to collect it's leaves -- the AND nodes.

For example, currently you reject `N = ((x1 & c1) | (x2 & c2)) | ((x3 & c3) | (x4 & c4))` which can be mapped to three BFIs.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136432/new/

https://reviews.llvm.org/D136432