[LLVMdev] Replacing a repetitive sequence of code with a loop

Wed Jun 3 11:57:27 PDT 2015

> On 03.06.2015, at 19:37, Frank Winter <fwinter at jlab.org> wrote:
> 
> Hey guys, in an HPC project I am working on I am given an LLVM program consisting of a linear sequence of repetitive junks of code with an uniform memory access pattern. Each code junk does the following: 1) loads some memory, 2) performs some arithmetic operations, 3) stores the result back to memory. The memory stride between consecutive junks is constant over the whole program, thus the whole program could be replaced with a single loop with it's loop body containing a generic version of said code junk.Here's an example (a short one, the real world program would be much longer):
> 
> define void @vec_plus_vec(float* noalias %arg0, float* noalias %arg1, float* noalias %arg2) {
> entrypoint:
>  %0 = bitcast float* %arg1 to <4 x float>*
>  %1 = bitcast float* %arg2 to <4 x float>*
>  %2 = load <4 x float>* %0, align 16
>  %3 = load <4 x float>* %1, align 16
>  %4 = fadd <4 x float> %3, %2
>  %5 = bitcast float* %arg0 to <4 x float>*
>  store <4 x float> %4, <4 x float>* %5, align 16
>  %6 = getelementptr float* %arg1, i64 4
>  %7 = getelementptr float* %arg2, i64 4
>  %8 = getelementptr float* %arg0, i64 4
>  %9 = bitcast float* %6 to <4 x float>*
>  %10 = bitcast float* %7 to <4 x float>*
>  %11 = load <4 x float>* %9, align 16
>  %12 = load <4 x float>* %10, align 16
>  %13 = fadd <4 x float> %12, %11
>  %14 = bitcast float* %8 to <4 x float>*
>  store <4 x float> %13, <4 x float>* %14, align 16
>  %15 = getelementptr float* %arg1, i64 8
>  %16 = getelementptr float* %arg2, i64 8
>  %17 = getelementptr float* %arg0, i64 8
>  %18 = bitcast float* %15 to <4 x float>*
>  %19 = bitcast float* %16 to <4 x float>*
>  %20 = load <4 x float>* %18, align 16
>  %21 = load <4 x float>* %19, align 16
>  %22 = fadd <4 x float> %21, %20
>  %23 = bitcast float* %17 to <4 x float>*
>  store <4 x float> %22, <4 x float>* %23, align 16
>  %24 = getelementptr float* %arg1, i64 12
>  %25 = getelementptr float* %arg2, i64 12
>  %26 = getelementptr float* %arg0, i64 12
>  %27 = bitcast float* %24 to <4 x float>*
>  %28 = bitcast float* %25 to <4 x float>*
>  %29 = load <4 x float>* %27, align 16
>  %30 = load <4 x float>* %28, align 16
>  %31 = fadd <4 x float> %30, %29
>  %32 = bitcast float* %26 to <4 x float>*
>  store <4 x float> %31, <4 x float>* %32, align 16
>  ret void
> }
> 
> The stride between two consecutive junks is 4*sizeof(float), thus could be condensed into a single loop. (The scenario with the constant stride between repetitive code junks is the first, most simple version. Later I will have to deal with arbitrary strides between junks. Requires an index array stored in the code as static data I guess)...
> 
> For this simple scenario what already existent LLVM pass would be closest to what I am trying to achieve?

There's a loop reroll pass in LLVM trunk that should do exactly this transformation.

- Ben