<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jul 7, 2015 at 1:30 PM, Sanjay Patel <span dir="ltr"><<a href="mailto:spatel@rotateright.com" target="_blank">spatel@rotateright.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>I forgot to update the SLP thread from last week. I have a patch up for review that would allow creating wider vectors as requested, but may increase SLP compile time:<br><a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D10950&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=8jtQ7zZm8iQAJpQaNpkr7Y4M4qjywzbPnzQMMGj8yYI&s=WY2L1p3xDmp9RxCyN74r4HI0-RIkxnXgIBDUeBNnJzc&e=" target="_blank">http://reviews.llvm.org/D10950</a><br><br>An audit of the trunk backends shows that only PowerPC + QPX and x86 + 

AVX / AVX512 would potentially get an extra round of store merging from the use of 

'getRegisterBitWidth()'. <br><br>As reported in the phab comments, I didn't see any compile time hit on test-suite for an AVX machine. I'm very curious to know if that patch causes further blowup in this example.<br></div><div><br></div>Frank, what causes a 10^6 instruction function to be generated? Can this be rolled into a loop?<br></div></blockquote><div><br></div><div>Yeah, when I've run into these "huge BB of dense computation" in the past, it is usually something that would be smaller and faster to implement as a loop with a table. Better to conserve DRAM bandwidth (K inst/cycle * N GHz adds up); you're effectively using the instruction stream as a table, and a not-super-dense one at that. Also it is easier to verify/tune the scheduling/vectorization of a small loop kernel.</div><div><br></div><div>-- Sean Silva</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jul 7, 2015 at 12:55 PM, Michael Zolotukhin <span dir="ltr"><<a href="mailto:mzolotukhin@apple.com" target="_blank">mzolotukhin@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Frank,<br>

<br>

The most time consuming part of SLP vectorizer (especially in cases like yours) is finding sets of consecutive stores. It's currently performed by a quadratic search (see routine SLPVectorizer::vectorizeStores) - we do pairwise comparisons between all pointers (but we do limit ourselves to look at at most 16 stores). I think it should be possible to group pointers with a common base, compute constant relative offset and just sort all of them - this way we’ll save a lot of expensive computations. However, I haven’t tried implementing this, and I guess there might be some hard corner cases too. Patches would be welcome here:)<br>

<br>

Thanks,<br>

Michael<br>

<div><div><br>

> On Jul 7, 2015, at 11:31 AM, Frank Winter <<a href="mailto:fwinter@jlab.org" target="_blank">fwinter@jlab.org</a>> wrote:<br>

><br>

> Hi all!<br>

><br>

> It takes the current SLP vectorizer too long to vectorize my scalar code. I am talking here about functions that have a single, huge basic block with O(10^6) instructions. Here's an example:<br>

><br>

>  %0 = getelementptr float* %arg1, i32 49<br>

>  %1 = load float* %0<br>

>  %2 = getelementptr float* %arg1, i32 4145<br>

>  %3 = load float* %2<br>

>  %4 = getelementptr float* %arg2, i32 49<br>

>  %5 = load float* %4<br>

>  %6 = getelementptr float* %arg2, i32 4145<br>

>  %7 = load float* %6<br>

>  %8 = fmul float %7, %1<br>

>  %9 = fmul float %5, %3<br>

>  %10 = fadd float %9, %8<br>

>  %11 = fmul float %7, %3<br>

>  %12 = fmul float %5, %1<br>

>  %13 = fsub float %12, %11<br>

>  %14 = getelementptr float* %arg3, i32 16<br>

>  %15 = load float* %14<br>

>  %16 = getelementptr float* %arg3, i32 4112<br>

>  %17 = load float* %16<br>

>  %18 = getelementptr float* %arg4, i32 0<br>

>  %19 = load float* %18<br>

>  %20 = getelementptr float* %arg4, i32 4096<br>

>  %21 = load float* %20<br>

>  %22 = fmul float %21, %15<br>

>  %23 = fmul float %19, %17<br>

>  %24 = fadd float %23, %22<br>

>  %25 = fmul float %21, %17<br>

>  %26 = fmul float %19, %15<br>

>  %27 = fsub float %26, %25<br>

>  %28 = fadd float %24, %10<br>

>  %29 = fadd float %27, %13<br>

>  %30 = getelementptr float* %arg0, i32 0<br>

>  store float %29, float* %30<br>

>  %31 = getelementptr float* %arg0, i32 4096<br>

>  store float %28, float* %31<br>

> ... and so on ...<br>

><br>

> The SLP vectorizer would create some code like this:<br>

><br>

>  %219 = insertelement <4 x float> %218, float %185, i32 2<br>

>  %220 = insertelement <4 x float> %219, float %197, i32 3<br>

>  %221 = fmul <4 x float> %216, %220<br>

>  %222 = fadd <4 x float> %221, %212<br>

>  %223 = fmul <4 x float> %207, %220<br>

> ..<br>

>  %234 = bitcast float* %165 to <4 x float>*<br>

>  store <4 x float> %233, <4 x float>* %234, align 4<br>

><br>

><br>

> With the current SLP implementation 99.5% of the time is spent in the SLP vectorizer and I have the impression that this can be improved for my case. I believe that the SLP vectorizer has far more capabilities than I would need for these simple (but huge) functions. And I was hoping that any of you have an idea how to remove functionality of the SLP vectorizer such that it still can vectorize those simple functions...?<br>

><br>

> Thanks,<br>

> Frank<br>

><br>

> _______________________________________________<br>

> LLVM Developers mailing list<br>

> <a href="mailto:LLVMdev@cs.uiuc.edu" target="_blank">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" rel="noreferrer" target="_blank">http://llvm.cs.uiuc.edu</a><br>

> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" rel="noreferrer" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

<br>

<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu" target="_blank">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" rel="noreferrer" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" rel="noreferrer" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

</div></div></blockquote></div><br></div>

</div></div><br>_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" rel="noreferrer" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" rel="noreferrer" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

<br></blockquote></div><br></div></div>