[PATCH] [AArch64] Add v8.1a "Rounding Doubling Multiply Add/Subtract" extension
James Molloy
james.molloy at arm.com
Fri Mar 27 07:48:04 PDT 2015
Hi Vladimir,
I've checked out your patch and fiddled around with it. It is possible, but ugly, to match your unmatchable pattern.
First, we need to properly legalize the intrinsic. It has type i16 (and takes i16 arguments). I16 is illegal so needs to be promoted, but the generic code can't promote it for you so we need to do it ourselves. There are two ways to do this: either create a new AArch64ISD:: node for this operation or operate on ISD::INTRINSIC_WO_CHAIN nodes themselves. For simplicity I've done the latter.
First, we need to tell the target-agnostic gubbins that we want to custom lower intrinsic nodes:
// Somewhere near AArch64ISelLowering.cpp:120
setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::i16, Custom);
Now we need to implement the custom lowering.
// In AArch64ISelLowering.cpp , function ReplaceNodeResults()
case ISD::INTRINSIC_WO_CHAIN: {
auto ID = getIntrinsicID(N);
if ((ID == Intrinsic::aarch64_neon_sqrdmulh ||
ID == Intrinsic::aarch64_neon_sqadd) &&
N->getValueType(0) == MVT::i16) {
// Promote to i32.
SDLoc DL(N);
auto Op0 = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i32, N->getOperand(1));
auto Op1 = DAG.getNode(ISD::ANY_EXTEND, DL, MVT::i32, N->getOperand(2));
auto NN = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32,
DAG.getConstant(ID, MVT::i32),
Op0, Op1);
NN = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, NN);
Results.push_back(NN);
}
return;
}
With this change, we can get code that at least doesn't crash:
umov w8, v0.h[3]
fmov s0, w0
fmov s1, w1
fmov s2, w8
sqrdmlah s0, s1, s2
fmov w0, s0
ret
That uses the i32 variant of the sqrdmlah instruction. We need to do at least this much, I think, because we can't have intrinsics that just crash the compiler.
Now, matching the pattern. The pattern we need to match is basically the same as the i32_indexed version of the pattern, but with a `v8i16` instead of `v4i32` type:
def : Pat<(i32 (Accum (i32 FPR32Op:$Rd),
(i32 (int_aarch64_neon_sqrdmulh
(i32 FPR32Op:$Rn),
(i32 (vector_extract (v8i16 V128:$Rm),
VectorIndexH:$idx)))))),
But the pattern to generate is even uglier still. This is the best i've got:
(COPY_TO_REGCLASS (f32 (INSERT_SUBREG (IMPLICIT_DEF),
(!cast<Instruction>(NAME#"i16_indexed")
(EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS FPR32Op:$Rd, FPR32)), hsub),
(EXTRACT_SUBREG (i32 (COPY_TO_REGCLASS FPR32Op:$Rn, FPR32)), hsub),
V128:$Rm, VectorIndexH:$idx),
hsub)), FPR32)>;
All the operands are going to be i32 types, so we need to make sure they're in the FPR32 register bank before we try and take the "hsub" subregister from them. That's the `COPY_TO_REGCLASS` nodes. We will then end up with an `f32` type, which in order to merge it into the `i32` the pattern must return, I've added another COPY_TO_REGCLASS so the return value of the entire pattern is merely `FPR32` (which both i32 and f32 can be allocated to).
This produces:
fmov s1, w1
fmov s2, w0
sqrdmlah h2, h1, v0.h[3]
fmov w0, s2
ret
Which is what we want. It can also produce chained sqrdmlah's, such as:
fmov s1, w1
fmov s2, w0
sqrdmlah h2, h1, v0.h[3]
sqrdmlah h2, h1, v0.h[2]
fmov w0, s2
ret
So I think this is certainly a valid way of implementing those intrinsics.
I'm not sure the best way forward here - @Tim, would you mind please checking my tomfoolery above and see if you agree or not? If so, implementing these is quite involved so possibly would be better done in a separate patch.
Cheers,
James
REPOSITORY
rL LLVM
http://reviews.llvm.org/D8502
EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
More information about the llvm-commits
mailing list