[PATCH] D53037: [InstCombine] combine a shuffle and an extract subvector shuffle

Tue Feb 5 08:37:28 PST 2019

fhahn added a comment.
Herald added a project: LLVM.

I found a case were this combine causes a codegen regression on AArch64. In the example below, `%s0` puts data into a 128 bit vector and `%s1` and `%s2` extract the lower and upper halves. Without folding `%s0` and `%s1`, we can generate a single AArch64 tbl instruction for `%s0` and a mov instruction for `%s2`. With the fold in this patch, we generate 3 additional instructions: additional tbl for `%s2` and 2 instructions for loading the mask.

So on AArch64, the combine produces worse code, in case we can generate a single tbl instruction for the top-level shuffle and we extract the lower and upper halves, which is cheap. Do you have an idea how to best address the issue?

  define <8 x i16> @test(<16 x i8> %s) {
  entry:
    %0 = sub <16 x i8> <i8 undef, i8 undef, i8 undef, i8 -1, i8 undef, i8 undef, i8 undef, i8 -1, i8 undef, i8 undef, i8 undef, i8 -1, i8 undef, i8 undef, i8 undef, i8 -1>, %s
    %s0 = shufflevector <16 x i8> %0, <16 x i8> undef, <16 x i32> <i32 3, i32 3, i32 3, i32 3, i32 7, i32 7, i32 7, i32 7, i32 11, i32 11, i32 11, i32 11, i32 15, i32 15, i32 15, i32 15>
    %s1 = shufflevector <16 x i8> %s0, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
    %s2 = shufflevector <16 x i8> %s0, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
    %a = call <8 x i16> @fn(<8 x i8> %s1, <8 x i8> %s2) #6
    ret <8 x i16> %a
  }

  declare <8 x i16> @fn(<8 x i8>, <8 x i8>)

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D53037/new/

https://reviews.llvm.org/D53037