[LLVMdev] Vectorization of loops with conditional dereferencing

Thu Nov 14 00:16:12 PST 2013

Hi Hal,

I may have missed the point here, but what about the "naive" solution of 
using a cascade of guarded, scalar loads in the vectorized code?

for (int i = 0; i < N; i+=4) {
   v0 = ...

   mask cond = r[i] > 0;
   vector l;
   if (mask[0])
     l[0] = m[0];
   if (mask[1])
     l[0] = m[1];
   ...

   v1 = v0 + l*...;
   blend(cond, v0, v1);
}

(I used array notation for accessing vector elements)

The rest of the code can remain vectorized (e.g. the += and *).
Or is it the case that the load is exactly the instruction you want 
vectorized because it dominates the runtime of the loop?

Oh, and sorry for the late reply, I don't have the time to check LLVMdev 
regularly atm :).

Cheers,
Ralf

On 01/11/13 14:40, Hal Finkel wrote:
> Nadav, Arnold, et al.,
>
> I have a number of loops that I would like us to be able to autovectorize (common, for example, in n-body inter-particle force kernels), and the problem is that they look like this:
>
> for (int i = 0; i < N; ++i) {
>    if (r[i] > 0)
>      v += m[i]*...;
> }
>
> where, as written, m[i] is not accessed unless the condition is true. The general problem (as is noted by the loop vectorizer), is that we don't know that dereferencing m[i] is safe for values where the guarding condition is not true, and so if-converting the block directly is not allowed.
>
> One possibility, if we disallow for m[i] to have been munmaped or page protected in the middle, is to first collect the first and last index for which the condition is true, and then run the vector loop only on that portion. For the loop above, I mean something like:
>
> FirstI = -1, LastI = -1, I = 0;
> for (; I < N; ++I)
>    if (r[i] > 0) {
>      FirstI = I;
>      break;
>    }
>
> for (; I < N; ++I) {
>    if (r[i] > 0)
>      LastI = I;
> }
>
> [In this case, since the guard dominates all side-effect, we don't need to do anything outside that range, otherwise, we'd need to run the scalar loop until FirstI, and then again from LastI until N]
>
> for (I = FirstI; I <= LastI; ...) {
>    // Run the scalar/vector loop sequence as before.
> }
>
> On the other hand, if we can't make this range assumption generally, then I'd assume that we could make it per-page (where a page is 4kB or similar). Then the scheme becomes more complicated, because we need to break the ranges on page boundaries (maybe something like this):
>
> Done = false;
> FirstI = -1, LastI = -1;
> while (!Done) {
>    for (I = FirstI+1; I < N; ++I)
>      if (r[i] > 0) {
>        FirstI = I;
>        break;
>      }
>
>    for (; I < N && !page_bound(&m[i]) && ...; ++I) {
>      if (r[i] > 0)
>        LastI = I;
>    }
>
>    Done = I == N;
>
>    for (I = FirstI; I <= LastI; ...) {
>      // Run the scalar/vector loop sequence as before.
>    }
> }
>
> I hope that we don't need to do something like this, but I've been disappointed by C/POSIX semantics before. On the other hand, we might want to break these into groups anyway, and allocate a local stack buffer to hold the mask computed from the condition, so that we don't have to reevaluate the condition in the loop.
>
> Also, I'm certainly open to other ways to do this. Are there any other ways? Thoughts?
>
> Thanks again,
> Hal
>