[PATCH] Make SLP vectorizer consider the cost that vectorized instruction cannot use memory operand as destination on X86

Mon Jun 15 08:46:08 PDT 2015

I did some more analysis for the target benchmark.

Previously, I found disabling slp vectorization can improve the
benchmark by 10% on sandybridge and 6% on westmere, all because of
less resource full stalls of reservation station according to perf
stat result. The increased reservation station full events are usually
caused by dependence chain increase -- after vectorization, there will
be one more instruction on the dependence chain to extract scalar
element from vector for external uses. According to my experiments
tweaking the testcases, this perf impact of such minor dependence
chain increase is heavily dependent on program context and
microarchitecture, and is very unpredictable. I feel it is difficult
to say whether we should or should not do vectorization based on those
perf metrics.

However another thing I found interesting was for the benchmark, slp
vectorization only partially vectorized the key func. I fully
vectorized the key func by hand and found the performance was improved
even more than disabling slp vectorization at least on sandybridge
(14% on sandybridge and 4% on westmere). The resource full stalls were
almost still the same as existing trunk generated version, but total
uops reduced a lot. This is good because it matches the common sense
that more successful vectorizations can improve performance (although
on westmere, still 2% worse than disabling slp vectorization).

I used a small case here to show why slp vectorization cannot fully
vectorize the function and got a simple patch to solve the problem.
unsigned long total;

The testcase 1.c:
void foo(unsigned long *a, long i) {
  a[0] >>= 4;                 // storeinst1
  a[1] >>= 4;                 // storeinst2
  total += a[i];
  a[0] >>= 4;                 // storeinst3
  a[1] >>= 4;                 // storeinst4
  total += a[i];
}

Our target benchmark has similar pattern as 1.c, but repeat multiple times.
void foo(unsigned long *a, long i) {
  a[0] >>= 4;
  a[1] >>= 4;
  total += a[i];
  a[0] >>= 4;
  a[1] >>= 4;
  total += a[i];
  ...
  a[0] >>= 4;
  a[1] >>= 4;
  total += a[i];
}

~/workarea/llvm-r238437/build/bin/clang -O2 -S  1.c
        shrq    $4, (%rdi)
        shrq    $4, 8(%rdi)                    // storeinst1 and
storeinst2 are not slp vectorized.
        movq    (%rdi,%rsi,8), %rax
        addq    %rax, total(%rip)
        movdqu  (%rdi), %xmm0
        psrlq   $4, %xmm0                   // storeinst3 and
storeinst4 are slp vectorized.
        movdqu  %xmm0, (%rdi)
        movq    (%rdi,%rsi,8), %rax
        addq    %rax, total(%rip)
        retq

In 1.c, storeinst1 has two consecutive stores: storeinsn2 and
storeinsn4, it will put the last one .i.e storeinst4 in
ConsecutiveChain:   ConsecutiveChain[storeinst1] = storeinst4 in
SLPVectorizer::vectorizeStores, but storeinst1 and storeinst4 cannot
be slp vectorized because of memory dependency. storeinst1 and
storeinst2 can potentially be vectorized but not, because they are not
regarded to be consecutive by ConsecutiveChain[].

In SLPVectorizer::vectorizeStores, slp vectorizer doesn't search all
the possible consecutive store instruction pairs, which will miss some
opportunity but can reduce search space significantly. For common
program, if a storeinst has multiple consecutive store instruction
candidates, the immediate succeeding or preceding one is usually the
best candidate, instead of the last one seen in Stores set.

The patch changes the iterating sequence in
SLPVectorizer::vectorizeStores to search the best candidate for
ConsecutiveChain[]. I will post the patch in a separate differential
revision.

Thanks,
Wei.

On Fri, Jun 12, 2015 at 9:00 AM, Xinliang David Li <davidxl at google.com> wrote:
> On Thu, Jun 11, 2015 at 10:29 PM, Nadav Rotem <nrotem at apple.com> wrote:
>>
>>> On Jun 11, 2015, at 9:25 AM, Wei Mi <wmi at google.com> wrote:
>>>
>>> From the target benchmark, for vectorized version, uops count is
>>> decreased but the count of reservation station full event is
>>> increased. I can create another small testcase and the vectorized
>>> version is better.
>>
>
>> It sounds like the current cost model is correct.
>
> More precisely speaking -- the current cost model is correct sometimes
> while not so in other situations.
>
>> Keep in mind that we are developing a compiler that compiles many different programs, not only one benchmark. I would be disappointed If we change the heuristics or cost model just to fix a specific workload without understanding the change and without having a good justification.
>
> Totally agree.