[PATCH] D32993: DAGCombine: Extend createBuildVecShuffle for case len(in_vec) = 4*len(result_vec)
Zvi Rackover via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed May 10 11:53:47 PDT 2017
zvi added inline comments.
================
Comment at: test/CodeGen/ARM/vpadd.ll:376
; CHECK-NEXT: vmovl.u8 q8, d16
-; CHECK-NEXT: vpadd.i16 d16, d16, d17
+; CHECK-NEXT: vuzp.16 q8, q9
+; CHECK-NEXT: vadd.i16 d16, d16, d18
----------------
efriedma wrote:
> zvi wrote:
> > This appears to be a regression for ARM codegen. Assuming it is, what the options for fixing it? IMHO these are the options ordered by preference:
> > 1. Can we improve the ARM backend to handle this case?
> > 2. Add a TLI hook for deciding when insert-extract sequences are better than composed shuffle?
> > 3. Do this only in the X86 lowering.
> We have a combine in the ARM backend which specifically combines vuzp+vadd to vpadd. It looks like the reason it isn't triggering here is that we're doing the vuzp in the wrong width; probably easy to fix.
Thanks for highlighting the problem, Eli.
The following case shows the same missed combine opportunity without this patch, by being lowered to the same asm code as the right-hand side of the diff.
```
define void @test(<16 x i8> *%cbcr, <4 x i16> *%X) nounwind ssp {
%tmp = load <16 x i8>, <16 x i8>* %cbcr
%tmp1 = zext <16 x i8> %tmp to <16 x i16>
%tmp2 = shufflevector <16 x i16> %tmp1, <16 x i16> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 undef, i32 undef, i32 undef, i32 undef>
%tmp2a = shufflevector <8 x i16> %tmp2, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%tmp3 = shufflevector <16 x i16> %tmp1, <16 x i16> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
%tmp3a = shufflevector <8 x i16> %tmp3, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%add = add <4 x i16> %tmp2a, %tmp3a
store <4 x i16> %add, <4 x i16>* %X, align 8
ret void
}
```
https://reviews.llvm.org/D32993
More information about the llvm-commits
mailing list