[PATCH] D75248: [Arm] Do not lower vmax/vmin to Neon instructions

Thu Feb 27 05:53:06 PST 2020

jgreenhalgh created this revision.
jgreenhalgh added reviewers: samparker, dmgreen.
Herald added subscribers: hiraditya, kristof.beyls.
Herald added a project: LLVM.

`FeatureSplatVFPToNeon` is on for Cortex-A15 and Exynos. The documentation for Cortex-A57 [1] and Cortex-A72 [2] suggests it would also be beneficial there. I put together a patch taking the obvious route towards adding  `FeatureSplatVFPToNeon`  to the features for those cores, but in looking at the root-cause of code generation that requires that pass to be enabled, came to a less intrusive solution, that looks more suitable for generic compilation.

The pass attempts to remove instances where a VFP register is written an an S register, and read as a D register (see the section Register Forwarding Hazards in the linked software optimisation guides).

By far the biggest contributor to instances of "write S, read D" is an optimisation applied to run `VMAX` through Neon; as so:

  define float @max_f32(float, float) {
      %3 = call nnan float @llvm.maxnum.f32(float %1, float %0)
      ret float %3
  }

  max_f32:
      vmov.f32        s2, s1
      vmax.f32        d0, d1, d0

Rather than propose `FeatureSplatVFPToNeon` for generic to work around this codegen, I'd instead like to ask whether we should just avoid this codegen in the first place, by disabling the lowering to Neon unless we're under `UseNEONForSinglePrecisionFP`.

Note that this patch is only applicable to Armv7-A 32-bit targets; when Armv-8-A is enabled, the single precision VMAXNM instruction can be used.

This patch implements that for 32-bit floats, but leaves 16-bit floats alone - they exist after Armv8.2-A, which none of Cortex-A15, Cortex-A57 or Cortex-A72 implement.

I've validated that this gives performance improvements on Cortex-A57 and Cortex-A72 similar to that you get by turning on `FeatureSplatVFPToNeon`, and also validated this change against Cortex-A53, where I saw no performance difference. Across a larger range of benchmarks performance came out even on Cortex-A76, with one >5% regression.

If this looks Ok, I'd appreciate someone applying it for me, as I have no commit rights.

---

[1]: http://infocenter.arm.com/help/topic/com.arm.doc.uan0015b/Cortex_A57_Software_Optimization_Guide_external.pdf
[2]: https://static.docs.arm.com/uan0016/a/cortex_a72_software_optimization_guide_external.pdf

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D75248

Files:
  llvm/lib/Target/ARM/ARMISelLowering.cpp
  llvm/test/CodeGen/ARM/fp16-promote.ll
  llvm/test/CodeGen/ARM/lower-vmax.ll
  llvm/test/CodeGen/ARM/vminmax.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D75248.246923.patch
Type: text/x-patch
Size: 3717 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20200227/d6cb8a92/attachment.bin>