[LLVMdev] Modifications to SLP

Tue Jul 7 11:55:31 PDT 2015

Hi Frank,

The most time consuming part of SLP vectorizer (especially in cases like yours) is finding sets of consecutive stores. It's currently performed by a quadratic search (see routine SLPVectorizer::vectorizeStores) - we do pairwise comparisons between all pointers (but we do limit ourselves to look at at most 16 stores). I think it should be possible to group pointers with a common base, compute constant relative offset and just sort all of them - this way we’ll save a lot of expensive computations. However, I haven’t tried implementing this, and I guess there might be some hard corner cases too. Patches would be welcome here:)

Thanks,
Michael

> On Jul 7, 2015, at 11:31 AM, Frank Winter <fwinter at jlab.org> wrote:
> 
> Hi all!
> 
> It takes the current SLP vectorizer too long to vectorize my scalar code. I am talking here about functions that have a single, huge basic block with O(10^6) instructions. Here's an example:
> 
>  %0 = getelementptr float* %arg1, i32 49
>  %1 = load float* %0
>  %2 = getelementptr float* %arg1, i32 4145
>  %3 = load float* %2
>  %4 = getelementptr float* %arg2, i32 49
>  %5 = load float* %4
>  %6 = getelementptr float* %arg2, i32 4145
>  %7 = load float* %6
>  %8 = fmul float %7, %1
>  %9 = fmul float %5, %3
>  %10 = fadd float %9, %8
>  %11 = fmul float %7, %3
>  %12 = fmul float %5, %1
>  %13 = fsub float %12, %11
>  %14 = getelementptr float* %arg3, i32 16
>  %15 = load float* %14
>  %16 = getelementptr float* %arg3, i32 4112
>  %17 = load float* %16
>  %18 = getelementptr float* %arg4, i32 0
>  %19 = load float* %18
>  %20 = getelementptr float* %arg4, i32 4096
>  %21 = load float* %20
>  %22 = fmul float %21, %15
>  %23 = fmul float %19, %17
>  %24 = fadd float %23, %22
>  %25 = fmul float %21, %17
>  %26 = fmul float %19, %15
>  %27 = fsub float %26, %25
>  %28 = fadd float %24, %10
>  %29 = fadd float %27, %13
>  %30 = getelementptr float* %arg0, i32 0
>  store float %29, float* %30
>  %31 = getelementptr float* %arg0, i32 4096
>  store float %28, float* %31
> ... and so on ...
> 
> The SLP vectorizer would create some code like this:
> 
>  %219 = insertelement <4 x float> %218, float %185, i32 2
>  %220 = insertelement <4 x float> %219, float %197, i32 3
>  %221 = fmul <4 x float> %216, %220
>  %222 = fadd <4 x float> %221, %212
>  %223 = fmul <4 x float> %207, %220
> ..
>  %234 = bitcast float* %165 to <4 x float>*
>  store <4 x float> %233, <4 x float>* %234, align 4
> 
> 
> With the current SLP implementation 99.5% of the time is spent in the SLP vectorizer and I have the impression that this can be improved for my case. I believe that the SLP vectorizer has far more capabilities than I would need for these simple (but huge) functions. And I was hoping that any of you have an idea how to remove functionality of the SLP vectorizer such that it still can vectorize those simple functions...?
> 
> Thanks,
> Frank
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev