[PATCH] D93192: [SLP] Fix vector element size for the store chains

Fri Dec 18 08:06:04 PST 2020

anton-afanasyev added a comment.

In-short: I'm still do sure that current patch is bugfixing. But this bug had induced some of boundary vectorization cases before fixing. Below is a typical case.

I've dug into the several files of the tests above (for instance, `consumer-typeset/z35.c`, here is details: http://llvm-compile-time-tracker.com/compare.php?from=a6c25539c1ed6b60dc3b92a5abe25f6bdb6e9788&to=283577865a97c26852292afae4fa57c689edbbfb&stat=size-total&details=on) and checked SLP stats:

  $ ~/llvm/llvm-project/build_deb_base/bin/clang ... -mllvm --stats -c consumer-typeset/z35.c -o z35.base.o 2>stat.base
  $ ~/llvm/llvm-project/build_deb_exp/bin/clang ... -mllvm --stats -c consumer-typeset/z35.c -o z35.exp.o 2>stat.exp
  $ size z35.base.o
     text    data     bss     dec     hex filename
     9057       0      56    9113    2399 z43.base.o
  $ size z43.exp.o
     text    data     bss     dec     hex filename
     8929       0      56    8985    2319 z43.exp.o
  $ grep SLP stat.*                                                                              
  stat.base:   35 SLP                          - Number of vector instructions generated

So, really code size decreased, but due to the number of vector instrs _not generated_.  I've investigated these instructions and also gathered vectorization costs.
Before bugfixing:

  %53 = insertelement <4 x %union.rec*> undef, %union.rec* %.in, i32 0
  %54 = shufflevector <4 x %union.rec*> %53, <4 x %union.rec*> undef, <4 x i32> zeroinitializer
  %55 = bitcast %union.rec* %.in to <4 x %union.rec*>*
  store <4 x %union.rec*> %54, <4 x %union.rec*>* %55, align 8, !tbaa !6

After bugfixing:

  %53 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 1, i32 1
  store %union.rec* %.in, %union.rec** %53, align 8, !tbaa !6
  %54 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 1, i32 0
  store %union.rec* %.in, %union.rec** %54, align 8, !tbaa !6
  %55 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 0, i32 1
  store %union.rec* %.in, %union.rec** %55, align 8, !tbaa !6
  %56 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 0, i32 0
  store %union.rec* %.in, %union.rec** %56, align 8, !tbaa !6

Asm code before:

  movq    %rax, %xmm0
  pshufd  $68, %xmm0, %xmm0               # xmm0 = xmm0[0,1,0,1]
  movdqu  %xmm0, 16(%rax)
  movdqu  %xmm0, (%rax)

Asm code after:

  movq    %rax, 24(%rax)
  movq    %rax, 16(%rax)
  movq    %rax, 8(%rax)
  movq    %rax, (%rax)

So actually `<4 x %union.rec*>` doesn't fit to maximum vector register (4x64=256), but store instr could be lowered to two `movdqu`. The total cost of tree is -1, allowing vectorization. But four stores hasn't been vectorized to two `<2 x %union.reg*>` after bugfixing, since cost of both parts is zero. We have a border case when 0 + 0 = -1 here.

Not sure what to do with it (allow `2 * MaxVecRegSize` for llvm vectors of tree?), but current fix is still necessary -- previously bug has exploited this "border case" accidently, for the random cases of combined stores chains.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93192/new/

https://reviews.llvm.org/D93192