[PATCH] D126234: [AArch64] Add support for FMA intrinsics to shouldSinkOperands.

Thu May 26 02:45:20 PDT 2022

fhahn updated this revision to Diff 432223.
fhahn added a comment.

Thanks for taking a look!

In D126234#3536753 <https://reviews.llvm.org/D126234#3536753>, @dmgreen wrote:

> I think that non-legal types will be legalized to legal types where a splat can still be used for lane indexing.
> And it seems like the first two operands of a fma are commutative?
> And I don't the splat index can be out of range, if we consider illegal types to be legalized.
>
> Which means we might be able to simplify it to this, maybe with a check for fullfp16 and the the scalar types are f16/f32/f64?
>
>   +    case Intrinsic::fma:
>        case Intrinsic::aarch64_neon_sqdmull:
>        case Intrinsic::aarch64_neon_sqdmulh:
>        case Intrinsic::aarch64_neon_sqrdmulh:
>          // Sink splats for index lane variants
>          if (isSplatShuffle(II->getOperand(0)))
>            Ops.push_back(&II->getOperandUse(0));
>          if (isSplatShuffle(II->getOperand(1)))
>            Ops.push_back(&II->getOperandUse(1));
>          return !Ops.empty();

That's a good point, the original checks are probably more paranoid than necessary. I updated the patch to use the common code with aarch64_neon_sqdmul*, with a check for the element type for fp16.

> Note that I have also heard of cases where the lane index fma instructions are slightly slower than a normal fma, as they get cracked into a dup and a fma micro-op. Which would make leaving them outside of the loop a better idea, and might make Machine Combiner a better place for this if it worked. But I don't know where that actually happens, and there doesn't seem to be an obvious place intree where the latencies are different except for one old core and this cyclone difference which doesn't make a lot of sense: https://godbolt.org/z/63eWsEqT1.   So I think sinking them unconditionally is probably fine if it saves the register pressure. We just might need to change that in the future if we find cases where it does matter.

>From the publicly available ARM software optimization guides, it looks like the indexed/non-indexed variants should give the same performance on A57/A75, but the indexed variants have less throughput on A55. A quick glance at the scheduling model for A55 seems to indicate that this difference is not reflected in the model though.

It also looks like the same issue would exist for other indexed variants handled there. If issues show up investigating handling this more granually sounds like a  good idea. We could also limit the sinking to cores where indexed and non-indexed variants have the same performance, if needed.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D126234/new/

https://reviews.llvm.org/D126234

Files:
  llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
  llvm/test/Transforms/CodeGenPrepare/AArch64/sink-free-instructions.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D126234.432223.patch
Type: text/x-patch
Size: 19108 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220526/bf342c40/attachment-0001.bin>