[LLVMdev] Vectorization of loops with conditional dereferencing
Renato Golin
renato.golin at linaro.org
Thu Nov 14 02:18:04 PST 2013
On 1 November 2013 13:40, Hal Finkel <hfinkel at anl.gov> wrote:
> Done = false;
> FirstI = -1, LastI = -1;
> while (!Done) {
> for (I = FirstI+1; I < N; ++I)
> if (r[i] > 0) {
> FirstI = I;
> break;
> }
>
> for (; I < N && !page_bound(&m[i]) && ...; ++I) {
> if (r[i] > 0)
> LastI = I;
> }
>
> Done = I == N;
>
> for (I = FirstI; I <= LastI; ...) {
> // Run the scalar/vector loop sequence as before.
> }
> }
>
Hi Hal,
Even if you could do something like this, there is no guarantee that (Last
- First) will be multiple of VF, or even bigger than VF, and you'll have to
cope with boundary conditions on every external loop, which, depending on
the distribution of r[] could be as many as half the size of r[] itself.
I can't see an algorithmically (compile-time or run-time) way to guarantee
that the number of clusters will be small without scanning the whole array.
So, even if you add such a check in run-time and only vectorize if the
number of clusters is small, the worst case scenario will run twice as many
times, (the initial check can be vectorized, as it's a read-only loop), but
the later cannot. in the best case scenario, you'll have only a handful of
loops, which will run in parallel.
Worst case: n/VF + n
Best case: n/VF + ammortized n/VF
For VF == 2,
* best case is as fast as scalar code, considering overheads.
* worst case is 50% slower
For VF == 4,
* best case is 50% faster than scalar code
* worst case is 25% slower
And all that, depends on each workload, so it'll change for every different
set of arguments, which in an n-body simulation, changes dynamically.
cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131114/1cbbfbff/attachment.html>
More information about the llvm-dev
mailing list