[LLVMdev] Modifications to SLP

Sean Silva chisophugis at gmail.com
Tue Jul 7 20:32:12 PDT 2015


On Tue, Jul 7, 2015 at 1:30 PM, Sanjay Patel <spatel at rotateright.com> wrote:

> I forgot to update the SLP thread from last week. I have a patch up for
> review that would allow creating wider vectors as requested, but may
> increase SLP compile time:
> http://reviews.llvm.org/D10950
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D10950&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=8jtQ7zZm8iQAJpQaNpkr7Y4M4qjywzbPnzQMMGj8yYI&s=WY2L1p3xDmp9RxCyN74r4HI0-RIkxnXgIBDUeBNnJzc&e=>
>
> An audit of the trunk backends shows that only PowerPC + QPX and x86 + AVX
> / AVX512 would potentially get an extra round of store merging from the use
> of 'getRegisterBitWidth()'.
>
> As reported in the phab comments, I didn't see any compile time hit on
> test-suite for an AVX machine. I'm very curious to know if that patch
> causes further blowup in this example.
>
> Frank, what causes a 10^6 instruction function to be generated? Can this
> be rolled into a loop?
>

Yeah, when I've run into these "huge BB of dense computation" in the past,
it is usually something that would be smaller and faster to implement as a
loop with a table. Better to conserve DRAM bandwidth (K inst/cycle * N GHz
adds up); you're effectively using the instruction stream as a table, and a
not-super-dense one at that. Also it is easier to verify/tune the
scheduling/vectorization of a small loop kernel.

-- Sean Silva


>
>
> On Tue, Jul 7, 2015 at 12:55 PM, Michael Zolotukhin <mzolotukhin at apple.com
> > wrote:
>
>> Hi Frank,
>>
>> The most time consuming part of SLP vectorizer (especially in cases like
>> yours) is finding sets of consecutive stores. It's currently performed by a
>> quadratic search (see routine SLPVectorizer::vectorizeStores) - we do
>> pairwise comparisons between all pointers (but we do limit ourselves to
>> look at at most 16 stores). I think it should be possible to group pointers
>> with a common base, compute constant relative offset and just sort all of
>> them - this way we’ll save a lot of expensive computations. However, I
>> haven’t tried implementing this, and I guess there might be some hard
>> corner cases too. Patches would be welcome here:)
>>
>> Thanks,
>> Michael
>>
>> > On Jul 7, 2015, at 11:31 AM, Frank Winter <fwinter at jlab.org> wrote:
>> >
>> > Hi all!
>> >
>> > It takes the current SLP vectorizer too long to vectorize my scalar
>> code. I am talking here about functions that have a single, huge basic
>> block with O(10^6) instructions. Here's an example:
>> >
>> >  %0 = getelementptr float* %arg1, i32 49
>> >  %1 = load float* %0
>> >  %2 = getelementptr float* %arg1, i32 4145
>> >  %3 = load float* %2
>> >  %4 = getelementptr float* %arg2, i32 49
>> >  %5 = load float* %4
>> >  %6 = getelementptr float* %arg2, i32 4145
>> >  %7 = load float* %6
>> >  %8 = fmul float %7, %1
>> >  %9 = fmul float %5, %3
>> >  %10 = fadd float %9, %8
>> >  %11 = fmul float %7, %3
>> >  %12 = fmul float %5, %1
>> >  %13 = fsub float %12, %11
>> >  %14 = getelementptr float* %arg3, i32 16
>> >  %15 = load float* %14
>> >  %16 = getelementptr float* %arg3, i32 4112
>> >  %17 = load float* %16
>> >  %18 = getelementptr float* %arg4, i32 0
>> >  %19 = load float* %18
>> >  %20 = getelementptr float* %arg4, i32 4096
>> >  %21 = load float* %20
>> >  %22 = fmul float %21, %15
>> >  %23 = fmul float %19, %17
>> >  %24 = fadd float %23, %22
>> >  %25 = fmul float %21, %17
>> >  %26 = fmul float %19, %15
>> >  %27 = fsub float %26, %25
>> >  %28 = fadd float %24, %10
>> >  %29 = fadd float %27, %13
>> >  %30 = getelementptr float* %arg0, i32 0
>> >  store float %29, float* %30
>> >  %31 = getelementptr float* %arg0, i32 4096
>> >  store float %28, float* %31
>> > ... and so on ...
>> >
>> > The SLP vectorizer would create some code like this:
>> >
>> >  %219 = insertelement <4 x float> %218, float %185, i32 2
>> >  %220 = insertelement <4 x float> %219, float %197, i32 3
>> >  %221 = fmul <4 x float> %216, %220
>> >  %222 = fadd <4 x float> %221, %212
>> >  %223 = fmul <4 x float> %207, %220
>> > ..
>> >  %234 = bitcast float* %165 to <4 x float>*
>> >  store <4 x float> %233, <4 x float>* %234, align 4
>> >
>> >
>> > With the current SLP implementation 99.5% of the time is spent in the
>> SLP vectorizer and I have the impression that this can be improved for my
>> case. I believe that the SLP vectorizer has far more capabilities than I
>> would need for these simple (but huge) functions. And I was hoping that any
>> of you have an idea how to remove functionality of the SLP vectorizer such
>> that it still can vectorize those simple functions...?
>> >
>> > Thanks,
>> > Frank
>> >
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150707/d5d9e32c/attachment.html>


More information about the llvm-dev mailing list