[PATCH] D138874: [InstCombine] canonicalize trunc + insert as bitcast + shuffle, part 3

Thu Dec 1 05:58:31 PST 2022

spatel added a comment.

In D138874#3962771 <https://reviews.llvm.org/D138874#3962771>, @dmgreen wrote:

>> We can create a select-shuffle for all targets because targets are expected to be able to lower select-shuffles reasonably.
>
> A perhaps minor point, I don't (think) I have objections to the patch, but I always considered select-shuffles to be somewhat of an x86-ism. I believe there is a special set of instructions for handling them, where the mask is stored as part of the instruction. As far as I understand there usually isn't a truly generic way to lower them efficiently (I'd be interested if there was!), and at worst case needing to resort to either lane moves or a constant mask + and/or.  If its only a single lane like all the tests then it would just be an extract+insert, which is simpler.
>
> Generally I would consider shuffles to be complex operations that often have a fairly high cost. Insert and trunc and bitcast are all usually simple.

That's a good point. x86 does have limited specialized select-shuffles (blends in x86 lingo) depending on which level of SSE/AVX is implemented. Most other SIMD targets have a vector bitwise select (`bsl` on AArch64 IIRC).
But yes, in the cases here "select-shuffle" is actually an over-specification/misnomer because we're only inserting a single element (what started as the scalar value) into the base vector.

I tried pushing a couple of tests through AArch64 codegen, and see diffs like this:

  lsr	x8, x0, #48
  mov	v0.h[3], w8
  ->
  fmov	d1, x0
  mov	v0.h[3], v1.h[3]

Does that seem neutral? If not, we could try harder to fold back to an insertelt in codegen or convert to a target-dependent transform in VectorCombine instead of a generic fold here.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138874/new/

https://reviews.llvm.org/D138874