[Patch] [SLPVectorizer] Vectorize Horizontal Reductions from Consecutive Loads

Wed Dec 17 00:58:53 PST 2014

This patch is pretty deeply flawed -- it triggers reassociation of floating
point arithmetic in non-fastmath code.

Consider this example which nicely demonstrates why reassociation is
problematic:

% cat hadd_bad.ll
; ModuleID = 'hadd_bad.c'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@.str = private unnamed_addr constant [32 x i8] c"a = %f, b = %f, c = %f, d
= %f\0A\00", align 1
@.str1 = private unnamed_addr constant [4 x i8] c"%f\0A\00", align 1

define float @hadd(float* nocapture noalias %a.0) {
entry:
  %a.0.v = load float* %a.0
  %a.1 = getelementptr float* %a.0, i32 1
  %a.1.v = load float* %a.1
  %add = fadd float %a.0.v, %a.1.v
  %a.2 = getelementptr float* %a.0, i32 2
  %a.2.v = load float* %a.2
  %a.3 = getelementptr float* %a.0, i32 3
  %a.3.v = load float* %a.3
  %add1 = fadd float %a.2.v, %a.3.v
  %add2 = fadd float %add, %add1
  ret float %add2
}

define i32 @main() {
entry:
  %a = alloca [4 x float]
  %a.0 = getelementptr [4 x float]* %a, i32 0, i32 0
  %a.1 = getelementptr [4 x float]* %a, i32 0, i32 1
  %a.2 = getelementptr [4 x float]* %a, i32 0, i32 2
  %a.3 = getelementptr [4 x float]* %a, i32 0, i32 3
  store float 0x47EFFFFFE0000000, float* %a.0
  store float 0xC7EFFFFFE0000000, float* %a.1
  store float 0x3F50624DE0000000, float* %a.2
  store float 0.000000e+00, float* %a.3
  %a.0.v = load float* %a.0
  %a.1.v = load float* %a.1
  %a.2.v = load float* %a.2
  %a.3.v = load float* %a.3
  %a.0.d = fpext float %a.0.v to double
  %a.1.d = fpext float %a.1.v to double
  %a.2.d = fpext float %a.2.v to double
  %a.3.d = fpext float %a.3.v to double
  %call = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([32
x i8]* @.str, i64 0, i64 0), double %a.0.d, double %a.1.d, double %a.2.d,
double %a.3.d)
  %call1 = tail call float @hadd(float* nocapture noalias %a.0)
  %conv = fpext float %call1 to double
  %call2 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4
x i8]* @.str1, i64 0, i64 0), double %conv)
  ret i32 0
}

declare i32 @printf(i8* nocapture readonly, ...)

% ./bin/llc -o - hadd_bad.ll | ./bin/clang -o hadd_bad -x assembler -
% ./hadd_bad
a = 340282346638528859811704183484516925440.000000, b =
-340282346638528859811704183484516925440.000000, c = 0.001000, d = 0.000000
0.001000

Now let's try running the SLP vectorizer:
% ./bin/opt -S -o - hadd_bad.ll -basicaa -slp-vectorizer -mcpu=corei7 | tee
/dev/stderr | ./bin/llc | ./bin/clang -o hadd_bad -x assembler -
; ModuleID = 'hadd_bad.ll'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@.str = private unnamed_addr constant [32 x i8] c"a = %f, b = %f, c = %f, d
= %f\0A\00", align 1
@.str1 = private unnamed_addr constant [4 x i8] c"%f\0A\00", align 1

define float @hadd(float* noalias nocapture %a.0) {
entry:
  %a.1 = getelementptr float* %a.0, i32 1
  %0 = bitcast float* %a.0 to <2 x float>*
  %1 = load <2 x float>* %0, align 4
  %a.2 = getelementptr float* %a.0, i32 2
  %a.3 = getelementptr float* %a.0, i32 3
  %2 = bitcast float* %a.2 to <2 x float>*
  %3 = load <2 x float>* %2, align 4
  %4 = fadd <2 x float> %1, %3
  %5 = extractelement <2 x float> %4, i32 0
  %6 = extractelement <2 x float> %4, i32 1
  %add2 = fadd float %5, %6
  ret float %add2
}

define i32 @main() {
<snip>
}

declare i32 @printf(i8* nocapture readonly, ...)

Note that it vectorized the hadd function by reassociating the additions.
Now if we run the program:

% ./hadd_bad
a = 340282346638528859811704183484516925440.000000, b =
-340282346638528859811704183484516925440.000000, c = 0.001000, d = 0.000000
0.000000

Different result.

Please be very careful about this kind of transformation! These things need
careful review. We've spent several days tracking down subtle precision
loss errors in our builds because horizontal operations "rarely" fire, and
so we had to get *really* unlucky to observe this.

Please revert this patch and submit a new patch that includes appropriate
guards for correct behavior.

On Fri, Dec 12, 2014 at 5:03 AM, Suyog Kamal Sarda <suyog.sarda at samsung.com>
wrote:
>
> Committed in r224119.
>
> Thanks a lot for the review.
>
> - Suyog
>
> ------- Original Message -------
> Sender : Suyog Kamal Sarda<suyog.sarda at samsung.com> Senior Software
> Engineer/SRI-Bangalore-TZN/Samsung Electronics
> Date : Dec 12, 2014 21:38 (GMT+09:00)
> Title : Re: Re: [Patch] [SLPVectorizer] Vectorize Horizontal Reductions
> from Consecutive Loads
>
> Hi Nadav,
>
> I ran LNT on x86 with 10 iterations and saw only one regression in
> performance in
> test case : MultiSource/Benchmarks/Prolangs-C++/fsm/fsm
>
> However, this test case doesn't seem to be relevant to my vectorization
> patch
> (I checked the TC) and hence I am ignoring it and going ahead with the
> commit
> as suggested by you.
>
> Attaching Screenshots of LNT results.
>
> Regards,
> Suyog
>
> ------- Original Message -------
> Sender : suyog sarda
> Date : Dec 12, 2014 04:10 (GMT+09:00)
> Title : Re: [Patch] [SLPVectorizer] Vectorize Horizontal Reductions from
> Consecutive Loads
>
> Hi Nadav,
>
>
> Thanks for reviewing the patch. I will upload the performance results by
> tomorrow.
>
> Just to be sure, you meant LNT test suite performance results, right?
>
>
> On Thu, Dec 11, 2014 at 10:32 PM, Nadav Rotem wrote:
>
> Hi Suyog,
>
> The change looks good to me.  I think that it would be a good idea to run
> the LLVM test suite and check if there there are any performance
> regressions.
>
> Thanks,
> Nadav
>
>
>
> > On Dec 11, 2014, at 4:38 AM, Suyog Kamal Sarda wrote:
> >
> > Hi All,
> >
> > This patch recognizes (+ (+ v0, v1) (+ v2, v3)), reorders them for
> bundling into vector of loads,
> > and vectorizes it. Earlier as discussed in LLVM mail threads, we didn't
> vectorize such horizontal reductions.
> >
> > Test case :
> >
> >       float hadd(float* a) {
> >           return (a[0] + a[1]) + (a[2] + a[3]);
> >        }
> >
> >
> > AArch64 assembly before patch :
> >
> >                 ldp   s0, s1, [x0]
> >       ldp     s2, s3, [x0, #8]
> >       fadd    s0, s0, s1
> >       fadd    s1, s2, s3
> >       fadd    s0, s0, s1
> >       ret
> >
> > AArch64 assembly after patch :
> >
> >                 ldp   d0, d1, [x0]
> >       fadd    v0.2s, v0.2s, v1.2s
> >       faddp   s0, v0.2s
> >       ret
> >
> > More work of recognizing (+(+(+ v0, v1) v2) v3) still remains. I will
> come up with this in another patch.
> >
> > Please help in reviewing the patch. No 'make-check' failures observed
> with this patch.
> >
> > (Would have preferred Phabricator, but its not working and hence sending
> via e-mail)
> >
> > Regards,
>
> > Suyog
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
>
>
>
>
> --
>
> With regards,
> Suyog Sarda
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141217/5e759923/attachment.html>