[PATCH] D25020: [ARM] Fix 26% performance regression on Cortex-A9 caused by not using VMLA/VMLS

Fri Oct 14 04:54:51 PDT 2016

rovka added a comment.

In https://reviews.llvm.org/D25020#569200, @eastig wrote:

> Hi Diana,
>
> Based on the results of the LNT runs I think the code checking accumulator forwarding is not needed at all.

Which code? You're checking accumulator forwarding both in ISel and in MLxExpansion.

> If I am correct only SIMD VMLx instructions can have issues. So we should check only them.
>  My thoughts:
> 
> - We can have features: HasSIMDVMLxHazards and HasVFPVMLxHazards. HasVMLxHazards can be built on them.

The VMLxHazards feature is enabled for Cortex-A7, A8, A9 and Swift. I think we need a better picture of the differences in behaviour between them before we rush to create more features.

> - The current checks in MLxExpansion and in ARMDAGToDAGISel::canUseVMLxForwarding are used only for SIMD instructions.

Regarding the ISel changes: those will be enabled for Cortex-A7, A8 and A9 (the intersection between HasVMLxHazards and HasVMLxForwarding). I'm a bit wary of making those changes without more benchmarking on A7 and A8 at least.

Thanks,
Diana

================
Comment at: lib/Target/ARM/ARMISelDAGToDAG.cpp:443
+    break;
+  }
+
----------------
eastig wrote:
> rovka wrote:
> > eastig wrote:
> > > eastig wrote:
> > > > rovka wrote:
> > > > > This is now checking only that the node can be lowered to VMLx. What happened to the part checking if forwarding can be used (i.e. mac following multiply or mac) and all the other checks?
> > > > I removed FMA checks because of the following reasons:
> > > > # FMA is lowered either VFMA or a library call if a target does not support VFPv4.
> > > > # I have not found any information about accumulator forwarding for VFMA.
> > > > 
> > > > I removed other checks because I could not write tests for them. Are there cases when they are false?
> > > Another point is that ARMDAGToDAGISel::hasNoVMLxHazardUse is only called instructions are combined into VMLx:
> > > 
> > > ARMInstrInfo.td:
> > > 
> > > ```
> > > // An 'fadd' node which checks for single non-hazardous use.
> > > def fadd_mlx : PatFrag<(ops node:$lhs, node:$rhs),(fadd node:$lhs, node:$rhs),[{
> > >   return hasNoVMLxHazardUse(N);
> > > }]>;
> > > 
> > > // An 'fsub' node which checks for single non-hazardous use.
> > > def fsub_mlx : PatFrag<(ops node:$lhs, node:$rhs),(fsub node:$lhs, node:$rhs),[{
> > >   return hasNoVMLxHazardUse(N);
> > > }]>;
> > > ```
> > Ok, the FMA thing sounds reasonable.
> > Some of the other checks are ok to remove, but from what I understand now you're only checking that you have a FADD/FSUB with a FMUL as its operand - this will be lowered to a VMLx, but in order to care about accumulator forwarding you need another VMUL/VMLA. The way it is written now, it will return true from hasNoVMLxHazardUse without actually looking at the uses, which may be bad if the use is some other NEON fp instruction (unless I'm missing something).
> I check that both operands are FMUL.
> See example:
> 
> ```
> a = ISD::FMUL
> b = ISD::FMUL c, d
> ...= ISD::FADD a, b
> ```
> is transformed into
> ```
> a = ARM::VMUL
> ...= ARM::VMLA a, c, d
> ```
> Accumulator forwarding is used for 'a'.
> 
> You are right some uses can be bad. Mixing VFP and SIMD instructions is not recommended  (http://infocenter.arm.com/help/topic/com.arm.doc.ddi0409i/CHDEDCDC.html).
> I don't see any performance regressions in the LNT testsuite when there is a mix of VFP and SIMD instructions without using VMLx (current behaviour) and when VMLx instructions are used (my patch). So maybe in case of VFP we don't need to check uses.
> If I understand correctly the note from the page: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409i/BCGDCIBA.html
> SIMD VMLx can have stalls depending on uses. It is not clear from the note if VFP instructions are affected. The current implementation checks both VFP and SIMD VMLx instructions. I think the current LNT run will show performance regressions when there are forwarding of SIMD accumulator to a VMLx instruction and bad uses after it. If there are performance regressions in case of SIMD VMLx then bad uses should be more important than accumulator forwarding.
> 
Oops, sorry, I read that as an || instead of &&.
Anyway, you're missing the vmla - vmla case, where the fadd should have a fmul and a fadd (and this fadd should have a fmul operand itself).

https://reviews.llvm.org/D25020