[PATCH] D132837: [ISel] Enable generating more fma instructions.

Thomas Symalla via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Sep 16 20:48:26 PDT 2022


tsymalla added inline comments.


================
Comment at: llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:14163-14164
   // This requires reassociation because it changes the order of operations.
-  SDValue FMA, E;
-  if (CanReassociate && isFusedOp(N0) &&
-      N0.getOperand(2).getOpcode() == ISD::FMUL && N0.hasOneUse() &&
-      N0.getOperand(2).hasOneUse()) {
-    FMA = N0;
-    E = N1;
-  } else if (CanReassociate && isFusedOp(N1) &&
-             N1.getOperand(2).getOpcode() == ISD::FMUL && N1.hasOneUse() &&
-             N1.getOperand(2).hasOneUse()) {
-    FMA = N1;
-    E = N0;
-  }
-  if (FMA && E) {
-    SDValue A = FMA.getOperand(0);
-    SDValue B = FMA.getOperand(1);
-    SDValue C = FMA.getOperand(2).getOperand(0);
-    SDValue D = FMA.getOperand(2).getOperand(1);
-    SDValue CDE = DAG.getNode(PreferredFusedOpcode, SL, VT, C, D, E);
-    return DAG.getNode(PreferredFusedOpcode, SL, VT, A, B, CDE);
+  // Moving the outermost FMA operands to the innermost FMA in the chain can
+  // help with eliminating a final copy to an output register.
+  if (CanReassociate) {
----------------
foad wrote:
> tsymalla wrote:
> > foad wrote:
> > > Can you explain this? It is not at all obvious. Are you sure it's not just random? Perhaps it helps your case, but could harm other cases?
> > When keeping the order 
> > 
> > 
> > ```
> > fadd (fma A, B, (fmul C, D)), E --> fma A, B, (fma C, D, E),
> > 
> > ```
> > The compiler will assign the result of the inner FMA to a new (virtual) register which will be used by the outer FMA.
> > The outer FMA will write its output to the same virtual register which will, in the simple example (fmac_sequence_simple) cause a COPY to the first available register at the end.
> > Finally, the register allocator will try recoloring the registers. 
> > This output register (in this case, %6) will be marked as recolorizable and assigned to the first available register. As the outer FMA is using %0 and % 1 (a and b), it can only assign to %2, which is in this case v2.
> > During virtual register rewriting, the compiler will now try to eliminate identity copies. Even if the final copy is regarded as candidate for deletion, it cannot do so because the COPY is used by the final SI_RETURN. So, this generates a superfluous V_MOV at the end.
> > 
> > By changing the operand order, we get following changes:
> > The multiplicand operand order of the innermost and the outermost FMA is swapped. So, innermost uses %0 and %1 and outermost uses %2 and %3. So, the compiler is able to assign %6 (output of outermost FMA) to $vgpr0 because the register is not used inside the outermost FMA (only in the innermost FMA).
> > By this, the final COPY can be eliminated because it's essentially a identity copy, removing the final V_MOV.
> > 
> > Changing the instruction order in the DAG basically frees up the desired output register for the register allocator.
> > 
> > I cannot assume this causes harm for other cases, but from fast-math point of view it should not cause any issues. Looking at the (currently failing) test cases, I don't see any actual issues, but please correct me if I'm wrong.
> https://reviews.llvm.org/differential/diff/460688/ adds a couple of variations of PPC test cases. All I have done is swapped the operands of the "fadd %F, %G" instruction. If you apply your patch on top of this you will see that the codegen gets worse for these cases, in exactly the same way that it got better for the existing tests. So I do not agree that changing the order of the fmas is a good thing in general.
> 
> Because of this, I would prefer to keep the existing order of the fmas. That makes the combine simpler because you only have to mutate the fmul to an fma, and remove the fadd. You don't have to change the existing fmas at all.
Agreed. Thanks for constructing the tests.
However, in this example, the number of instructions (fma_innermost test) will not decrease due to the addition of the v_mov instead of constructing the return value in-place.
I don't think this will negatively affect real-life examples (but I'll check), but for the sake of code quality, we could try getting rid of these moves probably at some other place.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132837/new/

https://reviews.llvm.org/D132837



More information about the llvm-commits mailing list