[PATCH] D25020: [ARM] Fix 26% performance regression on Cortex-A9 caused by not using VMLA/VMLS

Wed Oct 12 08:43:30 PDT 2016

eastig added inline comments.

================
Comment at: lib/Target/ARM/ARMISelDAGToDAG.cpp:443
+    break;
+  }
+
----------------
rovka wrote:
> eastig wrote:
> > eastig wrote:
> > > rovka wrote:
> > > > This is now checking only that the node can be lowered to VMLx. What happened to the part checking if forwarding can be used (i.e. mac following multiply or mac) and all the other checks?
> > > I removed FMA checks because of the following reasons:
> > > # FMA is lowered either VFMA or a library call if a target does not support VFPv4.
> > > # I have not found any information about accumulator forwarding for VFMA.
> > > 
> > > I removed other checks because I could not write tests for them. Are there cases when they are false?
> > Another point is that ARMDAGToDAGISel::hasNoVMLxHazardUse is only called instructions are combined into VMLx:
> > 
> > ARMInstrInfo.td:
> > 
> > ```
> > // An 'fadd' node which checks for single non-hazardous use.
> > def fadd_mlx : PatFrag<(ops node:$lhs, node:$rhs),(fadd node:$lhs, node:$rhs),[{
> >   return hasNoVMLxHazardUse(N);
> > }]>;
> > 
> > // An 'fsub' node which checks for single non-hazardous use.
> > def fsub_mlx : PatFrag<(ops node:$lhs, node:$rhs),(fsub node:$lhs, node:$rhs),[{
> >   return hasNoVMLxHazardUse(N);
> > }]>;
> > ```
> Ok, the FMA thing sounds reasonable.
> Some of the other checks are ok to remove, but from what I understand now you're only checking that you have a FADD/FSUB with a FMUL as its operand - this will be lowered to a VMLx, but in order to care about accumulator forwarding you need another VMUL/VMLA. The way it is written now, it will return true from hasNoVMLxHazardUse without actually looking at the uses, which may be bad if the use is some other NEON fp instruction (unless I'm missing something).
I check that both operands are FMUL.
See example:

```
a = ISD::FMUL
b = ISD::FMUL c, d
...= ISD::FADD a, b
```
is transformed into
```
a = ARM::VMUL
...= ARM::VMLA a, c, d
```
Accumulator forwarding is used for 'a'.

You are right some uses can be bad. Mixing VFP and SIMD instructions is not recommended  (http://infocenter.arm.com/help/topic/com.arm.doc.ddi0409i/CHDEDCDC.html).
I don't see any performance regressions in the LNT testsuite when there is a mix of VFP and SIMD instructions without using VMLx (current behaviour) and when VMLx instructions are used (my patch). So maybe in case of VFP we don't need to check uses.
If I understand correctly the note from the page: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0409i/BCGDCIBA.html
SIMD VMLx can have stalls depending on uses. It is not clear from the note if VFP instructions are affected. The current implementation checks both VFP and SIMD VMLx instructions. I think the current LNT run will show performance regressions when there are forwarding of SIMD accumulator to a VMLx instruction and bad uses after it. If there are performance regressions in case of SIMD VMLx then bad uses should be more important than accumulator forwarding.

https://reviews.llvm.org/D25020