[PATCH] D150969: [AArch64] Try to convert two XTN and two SMLSL to UZP1, SMLSL and SMLSL2

Fri May 19 14:21:16 PDT 2023

efriedma added a comment.

Consider the following:

  #include <arm_neon.h>

  void foo(int16x8_t a, int32x4_t acc, int32x4_t *out, const int32_t *p) {
      int16x8_t b = vcombine_s16(vmovn_s32(vld1q_s32(&p[0])),
                                 vmovn_s32(vld1q_s32(&p[4])));
      acc = vmlsl_s16(acc, vget_low_s16(a), vget_low_s16(b));
      acc = vmlsl_high_s16(acc, a, b);
      *out = acc;
  }

  void foo2(int16x8_t a, int32x4_t acc, int32x4_t *out, const int32_t *p) {
      int16x8_t b = vuzp1q_s16(vreinterpretq_s16_s32(vld1q_s32(&p[0])),
                               vreinterpretq_s16_s32(vld1q_s32(&p[4])));
      acc = vmlsl_s16(acc, vget_low_s16(a), vget_low_s16(b));
      acc = vmlsl_high_s16(acc, a, b);
      *out = acc;
  }

  void foo3(int16x8_t a, int32x4_t acc, int32x4_t *out, const int32_t *p) {
      acc = vmlsl_s16(acc, vget_low_s16(a), vmovn_s32(vld1q_s32(&p[0])));
      acc = vmlsl_s16(acc, vget_high_s16(a), vmovn_s32(vld1q_s32(&p[4])));
      *out = acc;
  }

foo() is your original testcase; foo2() is modified to use intrinsics that more closely match the expected sequence, foo3 is modified to get rid of the redundant vcombine/vget pair.  clang and gcc generate essentially the same code for foo2() and foo3(); somehow the way foo() is written tickles some combine in gcc that makes it treat it like foo2 instead of foo3.

It looks like your patch fixes the code for both foo2 and foo3; is that right?

Can we generalize this to optimize the following?  Maybe split the transform into two steps: one to optimize the following, then one to optimize any remaining extra instructions?

  void foo4(int16x8_t a, int32x4_t acc, int32x4_t *out, const int32_t *p) {
      int16x8_t b = vcombine_s16(vmovn_s32(vld1q_s32(&p[0])),
                                 vmovn_s32(vld1q_s32(&p[4])));
      acc = vmlsl_high_s16(acc, a, b);
      *out = acc;
  }

Can we generalize this to handle other widening instructions that use the high half of the inputs?

Any thoughts on a DAGCombine vs. MIPeepholeOpt?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150969/new/

https://reviews.llvm.org/D150969