[Patch] [SLPVectorizer] Vectorize Horizontal Reductions from Consecutive Loads

Suyog Kamal Sarda suyog.sarda at samsung.com
Wed Dec 17 01:14:51 PST 2014


Hi Chandler,

Thanks for pointing out the error. I agree to it and will shortly revert the revision.

I have identified this and tried to address this issue - only do this for 'integer' type in 
http://reviews.llvm.org/D6675 (Some minor checking changes required here also.)

Do you agree with http://reviews.llvm.org/D6675  ?

Regards,
Suyog


------- Original Message -------
Sender : Chandler Carruth<chandlerc at google.com>
Date : Dec 17, 2014 17:58 (GMT+09:00)
Title : Re: [Patch] [SLPVectorizer] Vectorize Horizontal Reductions from Consecutive Loads

This patch is pretty deeply flawed -- it triggers reassociation of floating point arithmetic in non-fastmath code.


Consider this example which nicely demonstrates why reassociation is problematic:


% cat hadd_bad.ll
; ModuleID = 'hadd_bad.c'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"


@.str = private unnamed_addr constant [32 x i8] c"a = %f, b = %f, c = %f, d = %f\0A\00", align 1
@.str1 = private unnamed_addr constant [4 x i8] c"%f\0A\00", align 1


define float @hadd(float* nocapture noalias %a.0) {
entry:
  %a.0.v = load float* %a.0
  %a.1 = getelementptr float* %a.0, i32 1
  %a.1.v = load float* %a.1
  %add = fadd float %a.0.v, %a.1.v
  %a.2 = getelementptr float* %a.0, i32 2
  %a.2.v = load float* %a.2
  %a.3 = getelementptr float* %a.0, i32 3
  %a.3.v = load float* %a.3
  %add1 = fadd float %a.2.v, %a.3.v
  %add2 = fadd float %add, %add1
  ret float %add2
}


define i32 @main() {
entry:
  %a = alloca [4 x float]
  %a.0 = getelementptr [4 x float]* %a, i32 0, i32 0
  %a.1 = getelementptr [4 x float]* %a, i32 0, i32 1
  %a.2 = getelementptr [4 x float]* %a, i32 0, i32 2
  %a.3 = getelementptr [4 x float]* %a, i32 0, i32 3
  store float 0x47EFFFFFE0000000, float* %a.0
  store float 0xC7EFFFFFE0000000, float* %a.1
  store float 0x3F50624DE0000000, float* %a.2
  store float 0.000000e+00, float* %a.3
  %a.0.v = load float* %a.0
  %a.1.v = load float* %a.1
  %a.2.v = load float* %a.2
  %a.3.v = load float* %a.3
  %a.0.d = fpext float %a.0.v to double
  %a.1.d = fpext float %a.1.v to double
  %a.2.d = fpext float %a.2.v to double
  %a.3.d = fpext float %a.3.v to double
  %call = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([32 x i8]* @.str, i64 0, i64 0), double %a.0.d, double %a.1.d, double %a.2.d, double %a.3.d)
  %call1 = tail call float @hadd(float* nocapture noalias %a.0)
  %conv = fpext float %call1 to double
  %call2 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str1, i64 0, i64 0), double %conv)
  ret i32 0
}


declare i32 @printf(i8* nocapture readonly, ...)


% ./bin/llc -o - hadd_bad.ll | ./bin/clang -o hadd_bad -x assembler -

% ./hadd_bad
a = 340282346638528859811704183484516925440.000000, b = -340282346638528859811704183484516925440.000000, c = 0.001000, d = 0.000000
0.001000


Now let's try running the SLP vectorizer:
% ./bin/opt -S -o - hadd_bad.ll -basicaa -slp-vectorizer -mcpu=corei7 | tee /dev/stderr | ./bin/llc | ./bin/clang -o hadd_bad -x assembler -
; ModuleID = 'hadd_bad.ll'
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"


@.str = private unnamed_addr constant [32 x i8] c"a = %f, b = %f, c = %f, d = %f\0A\00", align 1
@.str1 = private unnamed_addr constant [4 x i8] c"%f\0A\00", align 1


define float @hadd(float* noalias nocapture %a.0) {
entry:
  %a.1 = getelementptr float* %a.0, i32 1
  %0 = bitcast float* %a.0 to <2 x float>*
  %1 = load <2 x float>* %0, align 4
  %a.2 = getelementptr float* %a.0, i32 2
  %a.3 = getelementptr float* %a.0, i32 3
  %2 = bitcast float* %a.2 to <2 x float>*
  %3 = load <2 x float>* %2, align 4
  %4 = fadd <2 x float> %1, %3
  %5 = extractelement <2 x float> %4, i32 0
  %6 = extractelement <2 x float> %4, i32 1
  %add2 = fadd float %5, %6
  ret float %add2
}


define i32 @main() {
<snip>
}


declare i32 @printf(i8* nocapture readonly, ...)


Note that it vectorized the hadd function by reassociating the additions. Now if we run the program:


% ./hadd_bad
a = 340282346638528859811704183484516925440.000000, b = -340282346638528859811704183484516925440.000000, c = 0.001000, d = 0.000000
0.000000


Different result.


Please be very careful about this kind of transformation! These things need careful review. We've spent several days tracking down subtle precision loss errors in our builds because horizontal operations "rarely" fire, and so we had to get *really* unlucky to observe this.


Please revert this patch and submit a new patch that includes appropriate guards for correct behavior.


On Fri, Dec 12, 2014 at 5:03 AM, Suyog Kamal Sarda <suyog.sarda at samsung.com> wrote:
Committed in r224119.

Thanks a lot for the review.

- Suyog

------- Original Message -------
Sender : Suyog Kamal Sarda<suyog.sarda at samsung.com> Senior Software Engineer/SRI-Bangalore-TZN/Samsung Electronics
Date : Dec 12, 2014 21:38 (GMT+09:00)
Title : Re: Re: [Patch] [SLPVectorizer] Vectorize Horizontal Reductions from Consecutive Loads

Hi Nadav,

I ran LNT on x86 with 10 iterations and saw only one regression in performance in
test case : MultiSource/Benchmarks/Prolangs-C++/fsm/fsm

However, this test case doesn't seem to be relevant to my vectorization patch
(I checked the TC) and hence I am ignoring it and going ahead with the commit
as suggested by you.

Attaching Screenshots of LNT results.

Regards,
Suyog

------- Original Message -------
Sender : suyog sarda
Date : Dec 12, 2014 04:10 (GMT+09:00)
Title : Re: [Patch] [SLPVectorizer] Vectorize Horizontal Reductions from Consecutive Loads

Hi Nadav,


Thanks for reviewing the patch. I will upload the performance results by tomorrow.

Just to be sure, you meant LNT test suite performance results, right?


On Thu, Dec 11, 2014 at 10:32 PM, Nadav Rotem wrote:

Hi Suyog,

The change looks good to me.  I think that it would be a good idea to run the LLVM test suite and check if there there are any performance regressions.

Thanks,
Nadav



> On Dec 11, 2014, at 4:38 AM, Suyog Kamal Sarda wrote:
>
> Hi All,
>
> This patch recognizes (+ (+ v0, v1) (+ v2, v3)), reorders them for bundling into vector of loads,
> and vectorizes it. Earlier as discussed in LLVM mail threads, we didn't vectorize such horizontal reductions.
>
> Test case :
>
>       float hadd(float* a) {
>           return (a[0] + a[1]) + (a[2] + a[3]);
>        }
>
>
> AArch64 assembly before patch :
>
>                 ldp   s0, s1, [x0]
>       ldp     s2, s3, [x0, #8]
>       fadd    s0, s0, s1
>       fadd    s1, s2, s3
>       fadd    s0, s0, s1
>       ret
>
> AArch64 assembly after patch :
>
>                 ldp   d0, d1, [x0]
>       fadd    v0.2s, v0.2s, v1.2s
>       faddp   s0, v0.2s
>       ret
>
> More work of recognizing (+(+(+ v0, v1) v2) v3) still remains. I will come up with this in another patch.
>
> Please help in reviewing the patch. No 'make-check' failures observed with this patch.
>
> (Would have preferred Phabricator, but its not working and hence sending via e-mail)
>
> Regards,

> Suyog



_______________________________________________
llvm-commits mailing list
llvm-commits at cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits






--

With regards,
Suyog Sarda

_______________________________________________
llvm-commits mailing list
llvm-commits at cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits




More information about the llvm-commits mailing list