[llvm-bugs] [Bug 42070] New: SLP Vectorizer fails to vectorize a horizontal pattern when it's repeated

Thu May 30 06:11:55 PDT 2019

https://bugs.llvm.org/show_bug.cgi?id=42070

            Bug ID: 42070
           Summary: SLP Vectorizer fails to vectorize a horizontal pattern
                    when it's repeated
           Product: libraries
           Version: trunk
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Scalar Optimizations
          Assignee: unassignedbugs at nondot.org
          Reporter: flashmozzg at gmail.com
                CC: llvm-bugs at lists.llvm.org

Seems to be loosely related to https://bugs.llvm.org/show_bug.cgi?id=35448
since the problematic pattern is an often result of loop unrolling.

For some reason, SLP vectorizer fails to vectorize a horizontal reduction
pattern, when it's repeated, i.e. the following code:

float foo(float * __restrict x, float * __restrict y, unsigned len) {
    float acc = 0;
    acc += *x++ * *y++;
    acc += *x++ * *y++;
    x += 4; y += 4;
    acc += *x++ * *y++;
    acc += *x++ * *y++;
    return acc;
}

is compiled into:

define dso_local float @foo(float* noalias nocapture readonly, float* noalias
nocapture readonly, i32) local_unnamed_addr #0 {
  %4 = getelementptr inbounds float, float* %0, i64 1
  %5 = load float, float* %0, align 4, !tbaa !2
  %6 = getelementptr inbounds float, float* %1, i64 1
  %7 = load float, float* %1, align 4, !tbaa !2
  %8 = fmul float %5, %7
  %9 = fadd float %8, 0.000000e+00
  %10 = load float, float* %4, align 4, !tbaa !2
  %11 = load float, float* %6, align 4, !tbaa !2
  %12 = fmul float %10, %11
  %13 = fadd float %9, %12
  %14 = getelementptr inbounds float, float* %0, i64 6
  %15 = getelementptr inbounds float, float* %1, i64 6
  %16 = bitcast float* %14 to <2 x float>*
  %17 = load <2 x float>, <2 x float>* %16, align 4, !tbaa !2
  %18 = bitcast float* %15 to <2 x float>*
  %19 = load <2 x float>, <2 x float>* %18, align 4, !tbaa !2
  %20 = fmul <2 x float> %17, %19
  %21 = extractelement <2 x float> %20, i32 0
  %22 = fadd float %13, %21
  %23 = extractelement <2 x float> %20, i32 1
  %24 = fadd float %22, %23
  ret float %24
}

Note, that only the second half (after x+=4;y+=4) was vectorized, while each of
 them can be vectorized separately just fine. It looks like SLP vectorizer
initially attempts to reduce all loads and adds, fails because of the middle
increment and then never tries to vectorize the first half.

This can have a significant effect on performance in the presence of loop
unrolling.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20190530/e9c69052/attachment.html>