[PATCH] D74088: [x86] form broadcast of scalar memop even with >1 use

Thu Feb 6 11:56:17 PST 2020

spatel marked an inline comment as done.
spatel added inline comments.

================
Comment at: llvm/test/CodeGen/X86/extract-concat.ll:143-147
   %x = load <4 x i64>, <4 x i64>* %p
   %cat1 = shufflevector <4 x i64> %x, <4 x i64> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
   %cat2 = shufflevector <8 x i64> %cat1, <8 x i64> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
   %r = shufflevector <16 x i64> %cat2, <16 x i64> undef, <16 x i32> <i32 0, i32 4, i32 8, i32 12, i32 1, i32 5, i32 9, i32 13, i32 2, i32 6, i32 10, i32 14, i32 3, i32 7, i32 11, i32 15>
   ret  <16 x i64> %r
----------------
I looked closer at this example to see how AVX1 already gets the broadcasts - it's just luck and/or broken logic. We're inconsistently dealing with use-checks.

The test uses i64 elements, but v4i64 shuffles get legalized to v4f64 with AVX1 by bitcasting the load.

So even though we clearly have a load with multiple uses:
  t5: v4i64,ch = load<(load 32 from %ir.p)> t0, t2, undef:i64
    t40: v4i64 = vector_shuffle<0,0,0,0> t5, undef:v4i64
    t41: v4i64 = vector_shuffle<1,1,1,1> t5, undef:v4i64
  ...

After the bitcast is added, it is the *bitcast* node that subsequently has >1 use. But when we lower the shuffle, we peek through bitcasts and find that the load itself only has the one bitcast user.

If we modify this test to use <n x double> types, we get much worse codegen for AVX1:

```
	vmovapd	(%rdi), %ymm0
	vmovddup	%ymm0, %ymm1    ## ymm1 = ymm0[0,0,2,2]
	vperm2f128	$17, %ymm0, %ymm1, %ymm2 ## ymm2 = ymm1[2,3,2,3]
	vpermilpd	$15, %ymm0, %ymm0 ## ymm0 = ymm0[1,1,3,3]
	vperm2f128	$17, %ymm0, %ymm0, %ymm3 ## ymm3 = ymm0[2,3,2,3]
	vmovapd	(%rdi), %xmm1
	vmovddup	%xmm1, %xmm0    ## xmm0 = xmm1[0,0]
	vinsertf128	$1, %xmm0, %ymm0, %ymm0
	vpermilpd	$3, %xmm1, %xmm1 ## xmm1 = xmm1[1,1]
	vinsertf128	$1, %xmm1, %ymm1, %ymm1

```

But this patch would solve that problem by eliminating the use check - we'd get the ideal 4 broadcast instructions independent of float/int types.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74088/new/

https://reviews.llvm.org/D74088