[PATCH] D75667: [ARM][MVE] Enable SHRN for tail predication

Fri Mar 6 02:46:24 PST 2020

dmgreen added a comment.

OK. I'm not sure if that is enough, if I am understanding correctly. What if we load a v8i16, extend that into two v4i32's using something like a VMULL, then narrow that back into a single v8i16. I don't think this is something that autovec will produce (yet), but could come up from intrinsics in a way that people are likely to write. Something like this:

  #include <arm_mve.h>
  void test(short *x, short *y, short *z, int n) {
    while(n > 0) {
      int pred = vctp16q(n);
      int16x8_t a = vldrhq_z_s16(x, pred);
      int16x8_t b = vldrhq_z_s16(y, pred);
      int32x4_t top = vmulltq_int(a, b);
      int32x4_t bot = vmullbq_int(a, b);
      int16x8_t rtop = vqshrnbq(vuninitializedq_s16(), bot, 16);
      int16x8_t rbot = vqshrntq(rtop, top, 16);
      vstrhq_p_s16(z, rbot, pred);

      x += 8;
      y += 8;
      z += 8;
      n -= 8;
    }
  }

I'm pretty sure that tail predicating this would not be valid, as the top bits of one of the mul's could be cut off.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75667/new/

https://reviews.llvm.org/D75667

[PATCH] D75667: [ARM][MVE] Enable *SHRN* for tail predication

[PATCH] D75667: [ARM][MVE] Enable SHRN for tail predication