[llvm] [AArch64] Create set.fpmr intrinsic and assembly lowering (PR #114248)

Mon Nov 4 00:07:41 PST 2024

davemgreen wrote:

> > Can you explain how these are intended to be used? I am maybe a bit out of touch but I was under the impression that the FP8 intrinsics had fpmr operands to make data-flow analysis possible. Does a stand-along intrinsic not go against that?
> 
> Do you mean that the compiler would have a pass to improve the writes into FMPR?Is that what you mean? ATM this pass/data-flow analysis does not exist and we did not tried LICM.
> 
> For this implementation the compiler would generate a write to FMPR to every FP8 intrinsic. For instance the example I shown : svfloat16x2_t svcvt1_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) Will lower to: llvm.set.fpmr(i64 fpm) {<vscale x 8 x half>,<vscale x 8 x half>}llvm.aarch64.scvt2.nxv8i16(<vscale x 8 x i8> %zn) One of the reasons to have an llvm-ir intrinsic for FPMR instead of lowering to machine instruction late in the pipeline is because the compiler could hoist llvm.set.fprm outside a loop vectorizer, if it is a constant/does not change. AFAIU to hoist a machine instructions outside a loop is more complicated than llvm-i intrinsics.

Hi. Yeah sorry there were two issues I had.
 - The first is that the lowers to read + branch + set, as opposed to just set. This feels like a premature optimization to me, considering this is very new and we don't know whether that will be a sensible optimization or not when there is real hardware. My guess would be that it often hurts more than it help, so unless you have a very strong reason to add it you might consider simplifying it to a simple store, like other operations. We can always add it back in later if needed.
 - The other was a bit of a drive-by comment about the dataflow between the llvm.set.fpmr and the use. I agree that having a separate llvm.set.fpmr is a really sensible idea - that llvm can optimize the placement of and cse. We probably don't want to leave that until very late and rely on machine ir optimizations. But throwing away which fpmr values is acting on the current intrinsic throws away all the info about the types it is using and how the instruction operates. It essentially turns any optimization that you might want to do in the mid-end into a (function-)global optimization, as opposed to keeping it local. It would need to search back through the program to find what the current fpmr might be. It would feel like keeping that intact would be useful, with something like
`%fpmr = i64 llvm.set.fpmr(i64 fpm)`
`{<vscale x 8 x half>,<vscale x 8 x half>}llvm.aarch64.scvt2.nxv8i16(<vscale x 8 x i8> %zn, i64 %fpmr)`
Whether they are kept using Inaccessible memory I'm not sure, it might be useful to keep. It's ugly and prevents a lot of optimizations, but if it prevents overlapping ranges of fpmr then maybe that is a good thing. We would usually use it for things we don't have any other alternative as we have no data-fow information, but there is only a single register in the hardware. It is should be easy to remove later if we find we can, adding a new %fpmr parameter is more difficult.

https://github.com/llvm/llvm-project/pull/114248