[PATCH] D150969: [AArch64] Try to convert two XTN and two SMLSL to UZP1, SMLSL and SMLSL2

Mon May 22 04:19:25 PDT 2023

jaykang10 added a comment.

@efriedma Thanks for your kind comment.

> foo() is your original testcase; foo2() is modified to use intrinsics that more closely match the expected sequence, foo3 is modified to get rid of the redundant vcombine/vget pair.  clang and gcc generate essentially the same code for foo2() and foo3(); somehow the way foo() is written tickles some combine in gcc that makes it treat it like foo2 instead of foo3.

Yep, I agree with you.
I have already told the team it would be good to use the `vuzp1q_s16` intrinsic directly for the example `foo` than expecting optimization from compiler... but the team wants llvm to support the example like gcc as well as using the `vuzp1q_s16`...

> It looks like your patch fixes the code for both foo2 and foo3; is that right?

The patch was to fix the `foo` but it looks the `foo3` is also affected by this patch because it generates the mir sequence `xtn + xtn + smlsl + smlsl`.

> Can we generalize this to optimize the following?  Maybe split the transform into two steps: one to optimize the following, then one to optimize any remaining extra instructions?
>
>   void foo4(int16x8_t a, int32x4_t acc, int32x4_t *out, const int32_t *p) {
>       int16x8_t b = vcombine_s16(vmovn_s32(vld1q_s32(&p[0])),
>                                  vmovn_s32(vld1q_s32(&p[4])));
>       acc = vmlsl_high_s16(acc, a, b);
>       *out = acc;
>   }

um... the LLVM IR snippet before/after inlining output is as below.

  Before inlining
  define dso_local void @foo4(<8 x i16> noundef %0, <4 x i32> noundef %1, ptr noundef %2, ptr noundef %3) #0 {
    %5 = load <4 x i32>, ptr %3, align 4
    %6 = call <4 x i16> @vmovn_s32(<4 x i32> noundef %5)
    %7 = getelementptr inbounds i32, ptr %3, i64 4 
    %8 = load <4 x i32>, ptr %7, align 4
    %9 = call <4 x i16> @vmovn_s32(<4 x i32> noundef %8)
    %10 = call <8 x i16> @vcombine_s16(<4 x i16> noundef %6, <4 x i16> noundef %9)
    %11 = call <4 x i32> @vmlsl_high_s16(<4 x i32> noundef %1, <8 x i16> noundef %0, <8 x i16> noundef %10)
    store <4 x i32> %11, ptr %2, align 16, !tbaa !6
    ret void
  }

  After inlining
  define dso_local void @foo4(<8 x i16> noundef %0, <4 x i32> noundef %1, ptr noundef %2, ptr noundef %3) local_unnamed_addr #0 {
    %5 = load <4 x i32>, ptr %3, align 4
    %6 = trunc <4 x i32> %5 to <4 x i16>
    %7 = getelementptr inbounds i32, ptr %3, i64 4 
    %8 = load <4 x i32>, ptr %7, align 4
    %9 = trunc <4 x i32> %8 to <4 x i16>
    %10 = shufflevector <4 x i16> %6, <4 x i16> %9, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
    %11 = shufflevector <8 x i16> %0, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
    %12 = call <4 x i32> @llvm.aarch64.neon.smull.v4i32(<4 x i16> %11, <4 x i16> %9)
    %13 = sub <4 x i32> %1, %12
    store <4 x i32> %13, ptr %2, align 16, !tbaa !6
    ret void
  }

As you can see, after inlining, the `%10 = shufflevector` is redundant so it is removed as below in the end.

  define dso_local void @foo4(<8 x i16> noundef %0, <4 x i32> noundef %1, ptr nocapture noundef writeonly %2, ptr nocapture noun
  def readonly %3) local_unnamed_addr #0 {
    %5 = getelementptr inbounds i32, ptr %3, i64 4
    %6 = load <4 x i32>, ptr %5, align 4
    %7 = trunc <4 x i32> %6 to <4 x i16>
    %8 = shufflevector <8 x i16> %0, <8 x i16> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
    %9 = tail call <4 x i32> @llvm.aarch64.neon.smull.v4i32(<4 x i16> %8, <4 x i16> %7)
    %10 = sub <4 x i32> %1, %9
    store <4 x i32> %10, ptr %2, align 16, !tbaa !6
    ret void
  }

>From my personal opinion, I think it is hard to generate `uzp1` from above LLVM IR snippet. The legalized DAG is as below.

  Legalized selection DAG: %bb.0 'foo4:entry'
  SelectionDAG has 20 nodes:
    t0: ch,glue = EntryToken
        t8: i64,ch = CopyFromReg t0, Register:i64 %3
      t10: i64 = add nuw t8, Constant:i64<16>
    t13: v4i32,ch = load<(load (s128) from %ir.arrayidx2, align 4)> t0, t10, undef:i64
          t4: v4i32,ch = CopyFromReg t0, Register:v4i32 %1
              t2: v8i16,ch = CopyFromReg t0, Register:v8i16 %0
            t17: v4i16 = extract_subvector t2, Constant:i64<4>
            t14: v4i16 = truncate t13
          t24: v4i32 = AArch64ISD::SMULL t17, t14
        t21: v4i32 = sub t4, t24
        t6: i64,ch = CopyFromReg t0, Register:i64 %2
      t22: ch = store<(store (s128) into %ir.out, !tbaa !6)> t13:1, t21, t6, undef:i64
    t23: ch = AArch64ISD::RET_GLUE t22

> Can we generalize this to handle other widening instructions that use the high half of the inputs?

I think so.
The main issue is to generate `uzp1`. The `smlsl` is like a target node to detect the code sequence for `uzp1` so I think we could cover similar cases more.

> Any thoughts on a DAGCombine vs. MIPeepholeOpt?

The `foo`'s legalized DAG is as below.

  Legalized selection DAG: %bb.0 'foo:entry'
  SelectionDAG has 27 nodes:
    t0: ch,glue = EntryToken
    t2: v8i16,ch = CopyFromReg t0, Register:v8i16 %0
    t8: i64,ch = CopyFromReg t0, Register:i64 %3
    t11: v4i32,ch = load<(load (s128) from %ir.p, align 4)> t0, t8, undef:i64
      t14: i64 = add nuw t8, Constant:i64<16>
    t15: v4i32,ch = load<(load (s128) from %ir.arrayidx2, align 4)> t0, t14, undef:i64
        t27: ch = TokenFactor t11:1, t15:1
          t4: v4i32,ch = CopyFromReg t0, Register:v4i32 %1
              t18: v4i16 = extract_subvector t2, Constant:i64<0>
              t12: v4i16 = truncate t11
            t31: v4i32 = AArch64ISD::SMULL t18, t12
              t23: v4i16 = extract_subvector t2, Constant:i64<4>
              t16: v4i16 = truncate t15
            t30: v4i32 = AArch64ISD::SMULL t23, t16
          t25: v4i32 = add t31, t30
        t26: v4i32 = sub t4, t25
        t6: i64,ch = CopyFromReg t0, Register:i64 %2
      t28: ch = store<(store (s128) into %ir.out, !tbaa !6)> t27, t26, t6, undef:i64
    t29: ch = AArch64ISD::RET_GLUE t28

With `t25: v4i32 = add t31, t30`, we could do dagcombine as below because we do not generate custom node for `smlsl2` in DAG level. I think it is also not simple...

  t0: ch,glue = EntryToken
  t2: v8i16,ch = CopyFromReg t0, Register:v8i16 %0
  t8: i64,ch = CopyFromReg t0, Register:i64 %3
  t11: v8i16,ch = load<(load (s128) from %ir.p, align 4)> t0, t8, undef:i64
    t13: i64 = add nuw t8, Constant:i64<16>
  t14: v8i16,ch = load<(load (s128) from %ir.arrayidx2, align 4)> t0, t13, undef:i64
  t34: v8i16 = AArch64ISD::UZP1 t11, t14
      t28: ch = TokenFactor t11:1, t14:1
          t4: v4i32,ch = CopyFromReg t0, Register:v4i32 %1
            t17: v4i16 = extract_subvector t2, Constant:i64<0>
            t19: v4i16 = extract_subvector t34, Constant:i64<0>
          t32: v4i32 = AArch64ISD::SMULL t17, t19
        t35: v4i32 = sub t4, t32
          t23: v4i16 = extract_subvector t2, Constant:i64<4>
          t24: v4i16 = extract_subvector t34, Constant:i64<4>
        t31: v4i32 = AArch64ISD::SMULL t23, t24
      t36: v4i32 = sub t35, t31
      t6: i64,ch = CopyFromReg t0, Register:i64 %2
    t29: ch = store<(store (s128) into %ir.out, !tbaa !6)> t28, t36, t6, undef:i64
  t30: ch = AArch64ISD::RET_GLUE t29

With MIPeepholeOpt, it could be a bit simpler to add the other widening instructions that use the high half of the inputs... but I am not sure which one is better...

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150969/new/

https://reviews.llvm.org/D150969