[PATCH] D150969: [AArch64] Try to convert two XTN and two SMLSL to UZP1, SMLSL and SMLSL2
JinGu Kang via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon May 22 04:19:25 PDT 2023
jaykang10 added a comment.
@efriedma Thanks for your kind comment.
> foo() is your original testcase; foo2() is modified to use intrinsics that more closely match the expected sequence, foo3 is modified to get rid of the redundant vcombine/vget pair. clang and gcc generate essentially the same code for foo2() and foo3(); somehow the way foo() is written tickles some combine in gcc that makes it treat it like foo2 instead of foo3.
Yep, I agree with you.
I have already told the team it would be good to use the `vuzp1q_s16` intrinsic directly for the example `foo` than expecting optimization from compiler... but the team wants llvm to support the example like gcc as well as using the `vuzp1q_s16`...
> It looks like your patch fixes the code for both foo2 and foo3; is that right?
The patch was to fix the `foo` but it looks the `foo3` is also affected by this patch because it generates the mir sequence `xtn + xtn + smlsl + smlsl`.
> Can we generalize this to optimize the following? Maybe split the transform into two steps: one to optimize the following, then one to optimize any remaining extra instructions?
>
> void foo4(int16x8_t a, int32x4_t acc, int32x4_t *out, const int32_t *p) {
> int16x8_t b = vcombine_s16(vmovn_s32(vld1q_s32(&p[0])),
> vmovn_s32(vld1q_s32(&p[4])));
> acc = vmlsl_high_s16(acc, a, b);
> *out = acc;
> }
um... the LLVM IR snippet before/after inlining output is as below.
Before inlining
define dso_local void @foo4(<8 x i16> noundef %0, <4 x i32> noundef %1, ptr noundef %2, ptr noundef %3) #0 {
%5 = load <4 x i32>, ptr %3, align 4
%6 = call <4 x i16> @vmovn_s32(<4 x i32> noundef %5)
%7 = getelementptr inbounds i32, ptr %3, i64 4
%8 = load <4 x i32>, ptr %7, align 4
%9 = call <4 x i16> @vmovn_s32(<4 x i32> noundef %8)
%10 = call <8 x i16> @vcombine_s16(<4 x i16> noundef %6, <4 x i16> noundef %9)
%11 = call <4 x i32> @vmlsl_high_s16(<4 x i32> noundef %1, <8 x i16> noundef %0, <8 x i16> noundef %10)
store <4 x i32> %11, ptr %2, align 16, !tbaa !6
ret void
}
After inlining
define dso_local void @foo4(<8 x i16> noundef %0, <4 x i32> noundef %1, ptr noundef %2, ptr noundef %3) local_unnamed_addr #0 {
%5 = load <4 x i32>, ptr %3, align 4
%6 = trunc <4 x i32> %5 to <4 x i16>
%7 = getelementptr inbounds i32, ptr %3, i64 4
%8 = load <4 x i32>, ptr %7, align 4
%9 = trunc <4 x i32> %8 to <4 x i16>
%10 = shufflevector <4 x i16> %6, <4 x i16> %9, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%11 = shufflevector <8 x i16> %0, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
%12 = call <4 x i32> @llvm.aarch64.neon.smull.v4i32(<4 x i16> %11, <4 x i16> %9)
%13 = sub <4 x i32> %1, %12
store <4 x i32> %13, ptr %2, align 16, !tbaa !6
ret void
}
As you can see, after inlining, the `%10 = shufflevector` is redundant so it is removed as below in the end.
define dso_local void @foo4(<8 x i16> noundef %0, <4 x i32> noundef %1, ptr nocapture noundef writeonly %2, ptr nocapture noun
def readonly %3) local_unnamed_addr #0 {
%5 = getelementptr inbounds i32, ptr %3, i64 4
%6 = load <4 x i32>, ptr %5, align 4
%7 = trunc <4 x i32> %6 to <4 x i16>
%8 = shufflevector <8 x i16> %0, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
%9 = tail call <4 x i32> @llvm.aarch64.neon.smull.v4i32(<4 x i16> %8, <4 x i16> %7)
%10 = sub <4 x i32> %1, %9
store <4 x i32> %10, ptr %2, align 16, !tbaa !6
ret void
}
>From my personal opinion, I think it is hard to generate `uzp1` from above LLVM IR snippet. The legalized DAG is as below.
Legalized selection DAG: %bb.0 'foo4:entry'
SelectionDAG has 20 nodes:
t0: ch,glue = EntryToken
t8: i64,ch = CopyFromReg t0, Register:i64 %3
t10: i64 = add nuw t8, Constant:i64<16>
t13: v4i32,ch = load<(load (s128) from %ir.arrayidx2, align 4)> t0, t10, undef:i64
t4: v4i32,ch = CopyFromReg t0, Register:v4i32 %1
t2: v8i16,ch = CopyFromReg t0, Register:v8i16 %0
t17: v4i16 = extract_subvector t2, Constant:i64<4>
t14: v4i16 = truncate t13
t24: v4i32 = AArch64ISD::SMULL t17, t14
t21: v4i32 = sub t4, t24
t6: i64,ch = CopyFromReg t0, Register:i64 %2
t22: ch = store<(store (s128) into %ir.out, !tbaa !6)> t13:1, t21, t6, undef:i64
t23: ch = AArch64ISD::RET_GLUE t22
> Can we generalize this to handle other widening instructions that use the high half of the inputs?
I think so.
The main issue is to generate `uzp1`. The `smlsl` is like a target node to detect the code sequence for `uzp1` so I think we could cover similar cases more.
> Any thoughts on a DAGCombine vs. MIPeepholeOpt?
The `foo`'s legalized DAG is as below.
Legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 27 nodes:
t0: ch,glue = EntryToken
t2: v8i16,ch = CopyFromReg t0, Register:v8i16 %0
t8: i64,ch = CopyFromReg t0, Register:i64 %3
t11: v4i32,ch = load<(load (s128) from %ir.p, align 4)> t0, t8, undef:i64
t14: i64 = add nuw t8, Constant:i64<16>
t15: v4i32,ch = load<(load (s128) from %ir.arrayidx2, align 4)> t0, t14, undef:i64
t27: ch = TokenFactor t11:1, t15:1
t4: v4i32,ch = CopyFromReg t0, Register:v4i32 %1
t18: v4i16 = extract_subvector t2, Constant:i64<0>
t12: v4i16 = truncate t11
t31: v4i32 = AArch64ISD::SMULL t18, t12
t23: v4i16 = extract_subvector t2, Constant:i64<4>
t16: v4i16 = truncate t15
t30: v4i32 = AArch64ISD::SMULL t23, t16
t25: v4i32 = add t31, t30
t26: v4i32 = sub t4, t25
t6: i64,ch = CopyFromReg t0, Register:i64 %2
t28: ch = store<(store (s128) into %ir.out, !tbaa !6)> t27, t26, t6, undef:i64
t29: ch = AArch64ISD::RET_GLUE t28
With `t25: v4i32 = add t31, t30`, we could do dagcombine as below because we do not generate custom node for `smlsl2` in DAG level. I think it is also not simple...
t0: ch,glue = EntryToken
t2: v8i16,ch = CopyFromReg t0, Register:v8i16 %0
t8: i64,ch = CopyFromReg t0, Register:i64 %3
t11: v8i16,ch = load<(load (s128) from %ir.p, align 4)> t0, t8, undef:i64
t13: i64 = add nuw t8, Constant:i64<16>
t14: v8i16,ch = load<(load (s128) from %ir.arrayidx2, align 4)> t0, t13, undef:i64
t34: v8i16 = AArch64ISD::UZP1 t11, t14
t28: ch = TokenFactor t11:1, t14:1
t4: v4i32,ch = CopyFromReg t0, Register:v4i32 %1
t17: v4i16 = extract_subvector t2, Constant:i64<0>
t19: v4i16 = extract_subvector t34, Constant:i64<0>
t32: v4i32 = AArch64ISD::SMULL t17, t19
t35: v4i32 = sub t4, t32
t23: v4i16 = extract_subvector t2, Constant:i64<4>
t24: v4i16 = extract_subvector t34, Constant:i64<4>
t31: v4i32 = AArch64ISD::SMULL t23, t24
t36: v4i32 = sub t35, t31
t6: i64,ch = CopyFromReg t0, Register:i64 %2
t29: ch = store<(store (s128) into %ir.out, !tbaa !6)> t28, t36, t6, undef:i64
t30: ch = AArch64ISD::RET_GLUE t29
With MIPeepholeOpt, it could be a bit simpler to add the other widening instructions that use the high half of the inputs... but I am not sure which one is better...
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D150969/new/
https://reviews.llvm.org/D150969
More information about the llvm-commits
mailing list