[PATCH] D32993: DAGCombine: Extend createBuildVecShuffle for case len(in_vec) = 4*len(result_vec)

Thu May 18 14:28:37 PDT 2017

zvi added inline comments.

================
Comment at: test/CodeGen/ARM/vpadd.ll:376
 ; CHECK-NEXT:    vmovl.u8 q8, d16
-; CHECK-NEXT:    vpadd.i16 d16, d16, d17
+; CHECK-NEXT:    vuzp.16 q8, q9
+; CHECK-NEXT:    vadd.i16 d16, d16, d18
----------------
efriedma wrote:
> zvi wrote:
> > zvi wrote:
> > > zvi wrote:
> > > > efriedma wrote:
> > > > > zvi wrote:
> > > > > > This appears to be a regression for ARM codegen. Assuming it is, what the options for fixing it? IMHO these are the options ordered by preference:
> > > > > > 1. Can we improve the ARM backend to handle this case?
> > > > > > 2. Add a TLI hook for deciding when insert-extract sequences are better than composed shuffle?
> > > > > > 3. Do this only in the X86 lowering.
> > > > > We have a combine in the ARM backend which specifically combines vuzp+vadd to vpadd.  It looks like the reason it isn't triggering here is that we're doing the vuzp in the wrong width; probably easy to fix.
> > > > Thanks for highlighting the problem, Eli.
> > > > 
> > > > The following case shows the same missed combine opportunity without this patch, by being lowered to the same asm code as the right-hand side of the diff.
> > > > 
> > > > ```
> > > >  define void @test(<16 x i8> *%cbcr, <4 x i16> *%X) nounwind ssp {
> > > >    %tmp = load <16 x i8>, <16 x i8>* %cbcr
> > > >    %tmp1 = zext <16 x i8> %tmp to <16 x i16>
> > > >    %tmp2 = shufflevector <16 x i16> %tmp1, <16 x i16> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 undef, i32 undef, i32 undef, i32 undef>
> > > >    %tmp2a = shufflevector <8 x i16> %tmp2, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
> > > >    %tmp3 = shufflevector <16 x i16> %tmp1, <16 x i16> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
> > > >    %tmp3a = shufflevector <8 x i16> %tmp3, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
> > > >    %add = add <4 x i16> %tmp2a, %tmp3a
> > > >    store <4 x i16> %add, <4 x i16>* %X, align 8
> > > >    ret void
> > > >  }
> > > > 
> > > > ```
> > > > 
> > > > 
> > > > 
> > > Create [[ https://bugs.llvm.org/show_bug.cgi?id=32999 | Bug 32999 ]] to track this.
> > Just want to understand what is needed for this review to proceed. Does the ARM regression need to be fixed first, or are we ok with letting it in assuming it is easy to fix and will be fixed shortly after?
> Taking another look, I'm not convinced we should just be putting this issue aside.  It's not really an ARM-specific issue: we're deciding to create an EXTRACT_SUBVECTOR from an 128-bit shuffle rather than just creating a 64-bit shuffle, which might be more efficient depending on the hardware.  Equivalently, for x86 hardware, this is like creating a ymm shuffle when an xmm shuffle would be sufficient.
> 
> I mean, this issue isn't really important enough to block this change, but I'd like to make sure we at least understand what we're doing.
I see your point. Going back to a list of options:
1. Conservatively bail out if (min_mask_index*2 > NumElem || max_mask_index * 2 < NumElems) which means that we are accessing elements from one half of the input vector
2. Add a TLI hook to let the target decide if the large shuffle is ok
3. Always allow creation of large shuffles (what the current patch does)

https://reviews.llvm.org/D32993