[PATCH] D70223: [DAGCombine] Split vector load-update-store into single element stores

Fri Nov 22 08:18:15 PST 2019

spatel added a comment.

In D70223#1756723 <https://reviews.llvm.org/D70223#1756723>, @nemanjai wrote:

> @spatel I know you were initially against doing this in `InstCombine`, but I still believe that is a better place for this and a simpler way to implement it. If we narrow the scope of this to only handle insertions at constant indices, `lib/CodeGen/ScalarizeMaskedMemIntrin.cpp` should handle this quite well. And on the subject of cost model, I don't really think we need a target-specific cost model for this - simply the count of `load/store/insertelement` operations we are saving with the masked intrinsic weighed against the likely number of stores if we expand the masked intrinsic.
>  For the attached example, we are getting rid of a `load` and two `insertelement` instructions and introducing a mask that will expand to a maximum of 2 stores, so it probably makes sense to do it. On the other hand, if we get rid of a `load` and three `insertelement` instructions and introduce a mask that may expand to 3 stores, it is probably not worth it.

I'm still skeptical that IR canonicalization is better/easier, but I could be convinced. You're still going to have to do the memory analysis that @efriedma mentioned to make sure this is even a valid transform. Is that easier in IR?
To proceed on the IR path (and again, I'm skeptical that intcombine vs. some other IR pass is the right option), we would need to create codegen tests for multiple targets with the alternative IR sequences. Then, we would need to potentially improve that output for multiple targets. After that is done, we could then transform to the masked store intrinsic in IR.

Here's an example of a codegen test of IR alternatives - as I think was originally shown in the llvm-dev thread, we want to replace 2 out of 4 elements of a vector, but this is with values that are in scalar params/registers rather than constants:

  define void @insert_store(<4 x i32>* %q, i32 %s0, i32 %s3) {
    %t0 = load <4 x i32>, <4 x i32>* %q, align 16
    %vecins0 = insertelement <4 x i32> %t0, i32 %s0, i32 0
    %vecins3 = insertelement <4 x i32> %vecins0, i32 %s3, i32 3
    store <4 x i32> %vecins3, <4 x i32>* %q, align 16
    ret void
  }

  declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32, <4 x i1>)

  define void @masked_store(<4 x i32>* %q, i32 %s0, i32 %s3) {
    %vecins0 = insertelement <4 x i32> undef, i32 %s0, i32 0
    %vecins3 = insertelement <4 x i32> %vecins0, i32 %s3, i32 3
    call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %vecins3, <4 x i32>* %q, i32 16, <4 x i1> <i1 1, i1 0, i1 0, i1 1>)
    ret void
  }

The 2nd sequence looks way better for PPC, so great:

  addis 6, 2, .LCPI0_0 at toc@ha
  mtvsrd 0, 4
  lvx 4, 0, 3
  addi 4, 6, .LCPI0_0 at toc@l
  xxswapd	34, 0
  lvx 3, 0, 4
  mtvsrd 0, 5
  addis 4, 2, .LCPI0_1 at toc@ha
  addi 4, 4, .LCPI0_1 at toc@l
  vperm 2, 2, 4, 3
  xxswapd	35, 0
  lvx 4, 0, 4
  vperm 2, 3, 2, 4
  stvx 2, 0, 3

vs.

  stw 4, 0(3)
  stw 5, 12(3)

But here's what happens on x86 with AVX2 (this target has custom/legal vector inserts and vector masked store lowering):

  vmovdqa	(%rdi), %xmm0
  vpinsrd	$0, %esi, %xmm0, %xmm0
  vpinsrd	$3, %edx, %xmm0, %xmm0
  vmovdqa	%xmm0, (%rdi)

vs.

  vmovd	%esi, %xmm0
  vpinsrd	$3, %edx, %xmm0, %xmm0
  vmovdqa	LCPI1_0(%rip), %xmm1    ## xmm1 = [4294967295,0,0,4294967295]
  vpmaskmovd	%xmm0, %xmm1, (%rdi)

I'm not actually sure which of those we consider better. But neither is ideal. We'd be better off pretending there was no masked move instruction and get the expanded:

  movl	%esi, (%rdi)
  movl	%edx, 12(%rdi)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70223/new/

https://reviews.llvm.org/D70223