[PATCH] D73575: [InstCombine] canonicalize splat shuffle after cmp

Fri Nov 12 09:13:47 PST 2021

spatel added inline comments.

================
Comment at: llvm/test/Transforms/InstCombine/gep-inbounds-null.ll:95
+; CHECK-NEXT:    [[TMP0:%.*]] = icmp eq <2 x i8*> [[DOTSPLATINSERT]], zeroinitializer
+; CHECK-NEXT:    [[CND:%.*]] = shufflevector <2 x i1> [[TMP0]], <2 x i1> undef, <2 x i32> zeroinitializer
 ; CHECK-NEXT:    ret <2 x i1> [[CND]]
----------------
LuoYuanke wrote:
> spatel wrote:
> > LuoYuanke wrote:
> > > On X86 the result of the vector compare instruction would be in %k register, but there is no shuffle instruction for %k register. Here is the test case that was regressed due to this patch. We can duplicate it with "llc -mcpu=skylake-avx512". Any idea to improve it?
> > > 
> > > ```
> > > define void @before_canonicalization_i1(<16 x i1> %msk, i32 %in, <16 x i1>* %dst) {
> > > entry:
> > >   %insrt = insertelement <16 x i32> undef, i32 %in, i32 0
> > >   %splat = shufflevector <16 x i32> %insrt, <16 x i32> poison, <16 x i32> zeroinitializer
> > >   %mul = mul <16 x i32> <i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789>, %splat
> > >   %cmp1 = icmp eq <16 x i32> %mul, <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
> > >   %and = and <16 x i1> %msk, %cmp1
> > >   %bc = bitcast <16 x i1> %and to i16
> > >   %cmp = icmp ne i16 %bc, 0
> > >   br i1 %cmp, label %b1, label %b2
> > > b1:
> > >   store <16 x i1> %and, <16 x i1>* %dst, align 8
> > >   ret void
> > > b2:
> > >   store <16 x i1> %cmp1, <16 x i1>* %dst, align 8
> > >   ret void
> > > }
> > > 
> > > ```
> > > 
> > > 
> > > ```
> > > define void @after_canonicalization_i1(<16 x i1> %msk, i32 %in, <16 x i1>* %dst) {
> > > entry:
> > >   %insrt = insertelement <16 x i32> undef, i32 %in, i32 0
> > >   %0 = mul <16 x i32> %insrt, <i32 789, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
> > >   %1 = icmp eq <16 x i32> %0, zeroinitializer
> > >   %cmp1 = shufflevector <16 x i1> %1, <16 x i1> poison, <16 x i32> zeroinitializer
> > >   %and = and <16 x i1> %cmp1, %msk
> > >   %bc = bitcast <16 x i1> %and to i16
> > >   %cmp.not = icmp eq i16 %bc, 0
> > >   br i1 %cmp.not, label %b2, label %b1
> > > b1:
> > >   store <16 x i1> %and, <16 x i1>* %dst, align 8
> > >   ret void
> > > b2:
> > >   store <16 x i1> %cmp1, <16 x i1>* %dst, align 8
> > >   ret void
> > > }
> > > ```
> > > 
> > > 
> > > 
> > This example (not sure if this was over-reduced from original code...) shows a few potential missed folds. I'm not sure yet why we failed all of them. 
> > 
> > Reduced the vector length for readability:
> > 
> > ```
> > define <2 x i1> @src(i32 %in) { 
> >   %insrt = insertelement <2 x i32> undef, i32 %in, i32 0
> >   %m = mul <2 x i32> %insrt, <i32 789, i32 poison>
> >   %r = icmp eq <2 x i32> %m, zeroinitializer
> >   ret <2 x i1> %r
> > }
> > 
> > ```
> > 1. Scalarize the mul?
> > 2. Followed by scalarize the icmp?
> > 3. We also missed eliminating the mul - probably because we didn't recognize the constant with poison.
> > https://alive2.llvm.org/ce/z/TZyTU2
> I mean we should avoid shuffle <X x i1> after icmp with AVX512 enabled, because there is no shuffle instruction for k register. Take below code as example.
> 
> ```
> cat shufvXi32.ll
> 
> define <16 x i1> @shuffle(<16 x i1> %msk, i32 %in) {
> entry:
>   %insrt = insertelement <16 x i32> undef, i32 %in, i32 0
>   %splat = shufflevector <16 x i32> %insrt, <16 x i32> poison, <16 x i32> zeroinitializer
>   %mul = mul <16 x i32> <i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789, i32 789>, %splat
>   %cmp1 = icmp eq <16 x i32> %mul, <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
>   %and = and <16 x i1> %msk, %cmp1
>   ret <16 x i1> %and
> }
> ```
> opt -S < shufvxi32.ll -instcombine -o shufvXi1.ll
> 
> We get below transformed code.
> ```
> define <16 x i1> @shuffle(<16 x i1> %msk, i32 %in) {
> entry:
>   %insrt = insertelement <16 x i32> undef, i32 %in, i32 0
>   %0 = mul <16 x i32> %insrt, <i32 789, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
>   %1 = icmp eq <16 x i32> %0, zeroinitializer
>   %cmp1 = shufflevector <16 x i1> %1, <16 x i1> poison, <16 x i32> zeroinitializer
>   %and = and <16 x i1> %cmp1, %msk
>   ret <16 x i1> %and
> }
> 
> ```
> llc -mcpu=skylake-avx512 shufvXi32.l
> We got below assembly
> 
> ```
> # %bb.0:                                # %entry
>         vpsllw  $7, %xmm0, %xmm0
>         vpmovb2m        %xmm0, %k1
>         vpbroadcastd    %edi, %zmm0
>         vpmulld .LCPI0_0(%rip){1to16}, %zmm0, %zmm0
>         vptestnmd       %zmm0, %zmm0, %k0 {%k1}
>         vpmovm2b        %k0, %xmm0
>         vzeroupper
>         retq
> 
> ```
> llc -mcpu=skylake-avx512 shufvXi1.ll
> We got below assembly.
> 
> ```
> # %bb.0:                                # %entry
>         vpsllw  $7, %xmm0, %xmm0
>         vpxor   %xmm1, %xmm1, %xmm1
>         vmovd   %edi, %xmm2
>         movl    $789, %eax                      # imm = 0x315
>         vmovd   %eax, %xmm3
>         vpmulld %xmm3, %xmm2, %xmm2
>         vptestnmd       %zmm2, %zmm2, %k0
>         vpmovm2w        %k0, %ymm2
>         vpbroadcastw    %xmm2, %ymm2
>         vpmovw2m        %ymm2, %k1
>         vpcmpgtb        %xmm0, %xmm1, %k0 {%k1}
>         vpmovm2b        %k0, %xmm0
>         vzeroupper
>         retq
> 
> ```
> You can see there is more instruction generated for shufvXi1.ll.
We can't really avoid the transform based on target constraints - it's a canonicalization, so it's up to some later pass to invert it if necessary.

I think the mul and maybe even the icmp are distractions from the real problem in these examples.

There's a question of should we transform a splat of bool to sext in IR:
https://alive2.llvm.org/ce/z/ra54Xe

But that doesn't appear to change codegen, so I think there needs to be backend fix that does that transform (or the inverse?). Is there a bug report for this example? If not, can you file it? Thanks!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D73575/new/

https://reviews.llvm.org/D73575