[LLVMdev] Vectorization of loops with conditional dereferencing

Renato Golin renato.golin at linaro.org
Thu Nov 14 02:18:04 PST 2013


On 1 November 2013 13:40, Hal Finkel <hfinkel at anl.gov> wrote:

> Done = false;
> FirstI = -1, LastI = -1;
> while (!Done) {
>   for (I = FirstI+1; I < N; ++I)
>     if (r[i] > 0) {
>       FirstI = I;
>       break;
>     }
>
>   for (; I < N && !page_bound(&m[i]) && ...; ++I) {
>     if (r[i] > 0)
>       LastI = I;
>   }
>
>   Done = I == N;
>
>   for (I = FirstI; I <= LastI; ...) {
>     // Run the scalar/vector loop sequence as before.
>   }
> }
>

Hi Hal,

Even if you could do something like this, there is no guarantee that (Last
- First) will be multiple of VF, or even bigger than VF, and you'll have to
cope with boundary conditions on every external loop, which, depending on
the distribution of r[] could be as many as half the size of r[] itself.

I can't see an algorithmically (compile-time or run-time) way to guarantee
that the number of clusters will be small without scanning the whole array.
So, even if you add such a check in run-time and only vectorize if the
number of clusters is small, the worst case scenario will run twice as many
times, (the initial check can be vectorized, as it's a read-only loop), but
the later cannot. in the best case scenario, you'll have only a handful of
loops, which will run in parallel.

Worst case: n/VF + n
Best case: n/VF + ammortized n/VF

For VF == 2,
 * best case is as fast as scalar code, considering overheads.
 * worst case is 50% slower

For VF == 4,
 * best case is 50% faster than scalar code
 * worst case is 25% slower

And all that, depends on each workload, so it'll change for every different
set of arguments, which in an n-body simulation, changes dynamically.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131114/1cbbfbff/attachment.html>


More information about the llvm-dev mailing list