[PATCH] D118563: [AARCH64][NEON] Reuse extended vdup value in low version of long operations when doing tryCombineLongOpWithDup

Sun Jan 30 05:31:47 PST 2022

sunho created this revision.
Herald added subscribers: hiraditya, kristof.beyls.
sunho requested review of this revision.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.

Fixes https://github.com/llvm/llvm-project/issues/53261

tryCombineLongOpWithDup is a pass in Selection DAG combine stage that extends vdup vector to double sized so that patterns of high version of long operations (such as umull2, sabal2) can be used. The problem occurs when the original vdup vector is also used in low version of long operation. High version will be patched to use extended vdup vector, but low version will be in tact. This causes redundant vdup instructions to be generated because low version long operation is not reusing extended vdup vector where it can. Following is the code excerpts from the github issue demonstrating redundant vdup instructions.

  cpp
  uint8x16_t compare_ok(uint8x16_t x, uint8x16_t y, uint8x16_t X, uint8x16_t Y) {
      auto xx = vmull_u8(vget_low_u8(x), vget_low_u8(X));
      xx = vmlsl_u8(xx, vget_low_u8(y), vget_low_u8(Y));

      auto XX = vmull_u8(vget_high_u8(x), vget_high_u8(X));
      XX = vmlsl_u8(XX, vget_high_u8(y), vget_high_u8(Y));

      auto top_byte = vuzpq_u8(vreinterpretq_u8_u16(xx), vreinterpretq_u8_u16(XX)).val[1];
      return vshrq_n_u8(top_byte, 7);
  }
          // optimal usage of registers
          umull   v4.8h, v0.8b, v2.8b
          umull2  v0.8h, v0.16b, v2.16b
          umlsl   v4.8h, v1.8b, v3.8b
          umlsl2  v0.8h, v1.16b, v3.16b
          uzp2    v0.16b, v4.16b, v0.16b
          ushr    v0.16b, v0.16b, #7

  uint8x16_t compare_fail(uint8x16_t x, uint8x16_t y) {
      return compare_ok(x,y, vdupq_n_u8(33), vdupq_n_u8(119));
  }
          // same constant is allocated both to 8-byte and 16-byte registers
          movi    v2.8b, #33
          movi    v5.8b, #119
          movi    v3.16b, #33
          movi    v4.16b, #119
          umull   v2.8h, v0.8b, v2.8b
          umull2  v0.8h, v0.16b, v3.16b
          umlsl   v2.8h, v1.8b, v5.8b
          umlsl2  v0.8h, v1.16b, v4.16b
          uzp2    v0.16b, v2.16b, v0.16b
          ushr    v0.16b, v0.16b, #7

  uint8x16_t compare_fail(uint8x16_t x, uint8x16_t y, uint8x8_t coeffs) {
      return compare_ok(x,y, vdupq_lane_u8(coeffs, 0), vdupq_lane_u8(coeffs, 1));
  }
          // same constant is allocated both to 8-byte and 16-byte registers
          dup     v3.8b, v2.b[0]
          dup     v4.16b, v2.b[0]
          dup     v5.8b, v2.b[1]
          dup     v2.16b, v2.b[1]
          umull   v3.8h, v0.8b, v3.8b
          umull2  v0.8h, v0.16b, v4.16b
          umlsl   v3.8h, v1.8b, v5.8b
          umlsl2  v0.8h, v1.16b, v2.16b
          uzp2    v0.16b, v3.16b, v0.16b
          ushr    v0.16b, v0.16b, #7

This patch fixed this issue by iterating through the users of original vdup vector and patching the low version of long operations to use extract_low(new extended vdup vector).

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D118563

Files:
  llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
  llvm/test/CodeGen/AArch64/aarch64-combine-long-op-dup-noduplicate.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D118563.404367.patch
Type: text/x-patch
Size: 10377 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220130/2319dd10/attachment.bin>