[PATCH] D93192: [SLP] Fix vector element size for the store chains
Anton Afanasyev via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri Dec 18 08:06:04 PST 2020
anton-afanasyev added a comment.
In-short: I'm still do sure that current patch is bugfixing. But this bug had induced some of boundary vectorization cases before fixing. Below is a typical case.
I've dug into the several files of the tests above (for instance, `consumer-typeset/z35.c`, here is details: http://llvm-compile-time-tracker.com/compare.php?from=a6c25539c1ed6b60dc3b92a5abe25f6bdb6e9788&to=283577865a97c26852292afae4fa57c689edbbfb&stat=size-total&details=on) and checked SLP stats:
$ ~/llvm/llvm-project/build_deb_base/bin/clang ... -mllvm --stats -c consumer-typeset/z35.c -o z35.base.o 2>stat.base
$ ~/llvm/llvm-project/build_deb_exp/bin/clang ... -mllvm --stats -c consumer-typeset/z35.c -o z35.exp.o 2>stat.exp
$ size z35.base.o
text data bss dec hex filename
9057 0 56 9113 2399 z43.base.o
$ size z43.exp.o
text data bss dec hex filename
8929 0 56 8985 2319 z43.exp.o
$ grep SLP stat.*
stat.base: 35 SLP - Number of vector instructions generated
So, really code size decreased, but due to the number of vector instrs _not generated_. I've investigated these instructions and also gathered vectorization costs.
Before bugfixing:
%53 = insertelement <4 x %union.rec*> undef, %union.rec* %.in, i32 0
%54 = shufflevector <4 x %union.rec*> %53, <4 x %union.rec*> undef, <4 x i32> zeroinitializer
%55 = bitcast %union.rec* %.in to <4 x %union.rec*>*
store <4 x %union.rec*> %54, <4 x %union.rec*>* %55, align 8, !tbaa !6
After bugfixing:
%53 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 1, i32 1
store %union.rec* %.in, %union.rec** %53, align 8, !tbaa !6
%54 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 1, i32 0
store %union.rec* %.in, %union.rec** %54, align 8, !tbaa !6
%55 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 0, i32 1
store %union.rec* %.in, %union.rec** %55, align 8, !tbaa !6
%56 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 0, i32 0
store %union.rec* %.in, %union.rec** %56, align 8, !tbaa !6
Asm code before:
movq %rax, %xmm0
pshufd $68, %xmm0, %xmm0 # xmm0 = xmm0[0,1,0,1]
movdqu %xmm0, 16(%rax)
movdqu %xmm0, (%rax)
Asm code after:
movq %rax, 24(%rax)
movq %rax, 16(%rax)
movq %rax, 8(%rax)
movq %rax, (%rax)
So actually `<4 x %union.rec*>` doesn't fit to maximum vector register (4x64=256), but store instr could be lowered to two `movdqu`. The total cost of tree is -1, allowing vectorization. But four stores hasn't been vectorized to two `<2 x %union.reg*>` after bugfixing, since cost of both parts is zero. We have a border case when 0 + 0 = -1 here.
Not sure what to do with it (allow `2 * MaxVecRegSize` for llvm vectors of tree?), but current fix is still necessary -- previously bug has exploited this "border case" accidently, for the random cases of combined stores chains.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D93192/new/
https://reviews.llvm.org/D93192
More information about the llvm-commits
mailing list