<div dir="ltr">This patch is pretty deeply flawed -- it triggers reassociation of floating point arithmetic in non-fastmath code.<div><br></div><div>Consider this example which nicely demonstrates why reassociation is problematic:</div><div><br></div><div><div>% cat hadd_bad.ll</div><div>; ModuleID = 'hadd_bad.c'</div><div>target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"</div><div>target triple = "x86_64-unknown-linux-gnu"</div><div><br></div><div>@.str = private unnamed_addr constant [32 x i8] c"a = %f, b = %f, c = %f, d = %f\0A\00", align 1</div><div>@.str1 = private unnamed_addr constant [4 x i8] c"%f\0A\00", align 1</div><div><br></div><div>define float @hadd(float* nocapture noalias %a.0) {</div><div>entry:</div><div>  %a.0.v = load float* %a.0</div><div>  %a.1 = getelementptr float* %a.0, i32 1</div><div>  %a.1.v = load float* %a.1</div><div>  %add = fadd float %a.0.v, %a.1.v</div><div>  %a.2 = getelementptr float* %a.0, i32 2</div><div>  %a.2.v = load float* %a.2</div><div>  %a.3 = getelementptr float* %a.0, i32 3</div><div>  %a.3.v = load float* %a.3</div><div>  %add1 = fadd float %a.2.v, %a.3.v</div><div>  %add2 = fadd float %add, %add1</div><div>  ret float %add2</div><div>}</div><div><br></div><div>define i32 @main() {</div><div>entry:</div><div>  %a = alloca [4 x float]</div><div>  %a.0 = getelementptr [4 x float]* %a, i32 0, i32 0</div><div>  %a.1 = getelementptr [4 x float]* %a, i32 0, i32 1</div><div>  %a.2 = getelementptr [4 x float]* %a, i32 0, i32 2</div><div>  %a.3 = getelementptr [4 x float]* %a, i32 0, i32 3</div><div>  store float 0x47EFFFFFE0000000, float* %a.0</div><div>  store float 0xC7EFFFFFE0000000, float* %a.1</div><div>  store float 0x3F50624DE0000000, float* %a.2</div><div>  store float 0.000000e+00, float* %a.3</div><div>  %a.0.v = load float* %a.0</div><div>  %a.1.v = load float* %a.1</div><div>  %a.2.v = load float* %a.2</div><div>  %a.3.v = load float* %a.3</div><div>  %a.0.d = fpext float %a.0.v to double</div><div>  %a.1.d = fpext float %a.1.v to double</div><div>  %a.2.d = fpext float %a.2.v to double</div><div>  %a.3.d = fpext float %a.3.v to double</div><div>  %call = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([32 x i8]* @.str, i64 0, i64 0), double %a.0.d, double %a.1.d, double %a.2.d, double %a.3.d)</div><div>  %call1 = tail call float @hadd(float* nocapture noalias %a.0)</div><div>  %conv = fpext float %call1 to double</div><div>  %call2 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str1, i64 0, i64 0), double %conv)</div><div>  ret i32 0</div><div>}</div><div><br></div><div>declare i32 @printf(i8* nocapture readonly, ...)</div></div><div><br></div><div>% ./bin/llc -o - hadd_bad.ll | ./bin/clang -o hadd_bad -x assembler -<br></div><div><div>% ./hadd_bad</div><div>a = 340282346638528859811704183484516925440.000000, b = -340282346638528859811704183484516925440.000000, c = 0.001000, d = 0.000000</div><div>0.001000</div></div><div><br></div><div>Now let's try running the SLP vectorizer:</div><div><div>% ./bin/opt -S -o - hadd_bad.ll -basicaa -slp-vectorizer -mcpu=corei7 | tee /dev/stderr | ./bin/llc | ./bin/clang -o hadd_bad -x assembler -</div><div>; ModuleID = 'hadd_bad.ll'</div><div>target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"</div><div>target triple = "x86_64-unknown-linux-gnu"</div><div><br></div><div>@.str = private unnamed_addr constant [32 x i8] c"a = %f, b = %f, c = %f, d = %f\0A\00", align 1</div><div>@.str1 = private unnamed_addr constant [4 x i8] c"%f\0A\00", align 1</div><div><br></div><div>define float @hadd(float* noalias nocapture %a.0) {</div><div>entry:</div><div>  %a.1 = getelementptr float* %a.0, i32 1</div><div>  %0 = bitcast float* %a.0 to <2 x float>*</div><div>  %1 = load <2 x float>* %0, align 4</div><div>  %a.2 = getelementptr float* %a.0, i32 2</div><div>  %a.3 = getelementptr float* %a.0, i32 3</div><div>  %2 = bitcast float* %a.2 to <2 x float>*</div><div>  %3 = load <2 x float>* %2, align 4</div><div>  %4 = fadd <2 x float> %1, %3</div><div>  %5 = extractelement <2 x float> %4, i32 0</div><div>  %6 = extractelement <2 x float> %4, i32 1</div><div>  %add2 = fadd float %5, %6</div><div>  ret float %add2</div><div>}</div><div><br></div><div>define i32 @main() {</div><div><snip></div><div>}</div><div><br></div><div>declare i32 @printf(i8* nocapture readonly, ...)</div></div><div><br></div><div>Note that it vectorized the hadd function by reassociating the additions. Now if we run the program:</div><div><br></div><div><div>% ./hadd_bad</div><div>a = 340282346638528859811704183484516925440.000000, b = -340282346638528859811704183484516925440.000000, c = 0.001000, d = 0.000000</div><div>0.000000</div></div><div><br></div><div>Different result.</div><div><br></div><div>Please be very careful about this kind of transformation! These things need careful review. We've spent several days tracking down subtle precision loss errors in our builds because horizontal operations "rarely" fire, and so we had to get *really* unlucky to observe this.</div><div><br></div><div>Please revert this patch and submit a new patch that includes appropriate guards for correct behavior.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Dec 12, 2014 at 5:03 AM, Suyog Kamal Sarda <span dir="ltr"><<a href="mailto:suyog.sarda@samsung.com" target="_blank">suyog.sarda@samsung.com</a>></span> wrote:<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Committed in r224119.<br>

<br>

Thanks a lot for the review.<br>

<br>

- Suyog<br>

<br>

------- Original Message -------<br>

Sender : Suyog Kamal Sarda<<a href="mailto:suyog.sarda@samsung.com">suyog.sarda@samsung.com</a>> Senior Software Engineer/SRI-Bangalore-TZN/Samsung Electronics<br>

Date : Dec 12, 2014 21:38 (GMT+09:00)<br>

Title : Re: Re: [Patch] [SLPVectorizer] Vectorize Horizontal Reductions from Consecutive Loads<br>

<br>

Hi Nadav,<br>

<span class="im HOEnZb"><br>

I ran LNT on x86 with 10 iterations and saw only one regression in performance in<br>

test case : MultiSource/Benchmarks/Prolangs-C++/fsm/fsm<br>

<br>

However, this test case doesn't seem to be relevant to my vectorization patch<br>

(I checked the TC) and hence I am ignoring it and going ahead with the commit<br>

as suggested by you.<br>

<br>

Attaching Screenshots of LNT results.<br>

<br>

Regards,<br>

Suyog<br>

<br>

------- Original Message -------<br>

Sender : suyog sarda<br>

</span><span class="im HOEnZb">Date : Dec 12, 2014 04:10 (GMT+09:00)<br>

Title : Re: [Patch] [SLPVectorizer] Vectorize Horizontal Reductions from Consecutive Loads<br>

<br>

Hi Nadav,<br>

<br>

<br>

Thanks for reviewing the patch. I will upload the performance results by tomorrow.<br>

<br>

Just to be sure, you meant LNT test suite performance results, right?<br>

<br>

<br>

</span><span class="im HOEnZb">On Thu, Dec 11, 2014 at 10:32 PM, Nadav Rotem wrote:<br>

<br>

Hi Suyog,<br>

<br>

The change looks good to me.  I think that it would be a good idea to run the LLVM test suite and check if there there are any performance regressions.<br>

<br>

Thanks,<br>

Nadav<br>

<br>

<br>

<br>

</span><span class="im HOEnZb">> On Dec 11, 2014, at 4:38 AM, Suyog Kamal Sarda wrote:<br>

><br>

> Hi All,<br>

><br>

> This patch recognizes (+ (+ v0, v1) (+ v2, v3)), reorders them for bundling into vector of loads,<br>

> and vectorizes it. Earlier as discussed in LLVM mail threads, we didn't vectorize such horizontal reductions.<br>

><br>

> Test case :<br>

><br>

>       float hadd(float* a) {<br>

>           return (a[0] + a[1]) + (a[2] + a[3]);<br>

>        }<br>

><br>

><br>

> AArch64 assembly before patch :<br>

><br>

>                 ldp   s0, s1, [x0]<br>

>       ldp     s2, s3, [x0, #8]<br>

>       fadd    s0, s0, s1<br>

>       fadd    s1, s2, s3<br>

>       fadd    s0, s0, s1<br>

>       ret<br>

><br>

> AArch64 assembly after patch :<br>

><br>

>                 ldp   d0, d1, [x0]<br>

>       fadd    v0.2s, v0.2s, v1.2s<br>

>       faddp   s0, v0.2s<br>

>       ret<br>

><br>

> More work of recognizing (+(+(+ v0, v1) v2) v3) still remains. I will come up with this in another patch.<br>

><br>

> Please help in reviewing the patch. No 'make-check' failures observed with this patch.<br>

><br>

> (Would have preferred Phabricator, but its not working and hence sending via e-mail)<br>

><br>

> Regards,<br>

<br>

> Suyog<br>

<br>

<br>

</span><div class="HOEnZb"><div class="h5">_______________________________________________<br>

llvm-commits mailing list<br>

<a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>

<br>

<br>

<br>

<br>

<br>

<br>

--<br>

<br>

With regards,<br>

Suyog Sarda<br>

<br>

_______________________________________________<br>

llvm-commits mailing list<br>

<a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>

</div></div></blockquote></div></div>