[LLVMdev] Vectorization of loops with conditional dereferencing

Fri Nov 1 09:38:41 PDT 2013

Hi Hal,

Yes, I agree that this is a problem that prevents vectorization in many loops.  Another problem that we have is that sunk loads don’t preserve their control dependence properties.  For example in the code below, if we sink the load into the branch then we can't vectorize the loop.  

x = A[i]
if (cond) {
	sum += x;	
}

I agree with you that checking the first and last element for each 4k page has many disadvantages and I would like to explore other options. Yesterday Pekka proposed marking some operations as 'no-trap'.  Maybe we should consider marking some memory regions as ‘mapped'.  Every time we malloc memory or pass a pointer to statically allocated memory we could propagate the information about the availability and size of the memory.  What do you think ?

Thanks,
Nadav 

On Nov 1, 2013, at 6:40 AM, Hal Finkel <hfinkel at anl.gov> wrote:

> Nadav, Arnold, et al.,
> 
> I have a number of loops that I would like us to be able to autovectorize (common, for example, in n-body inter-particle force kernels), and the problem is that they look like this:
> 
> for (int i = 0; i < N; ++i) {
>  if (r[i] > 0)
>    v += m[i]*...;
> }
> 
> where, as written, m[i] is not accessed unless the condition is true. The general problem (as is noted by the loop vectorizer), is that we don't know that dereferencing m[i] is safe for values where the guarding condition is not true, and so if-converting the block directly is not allowed.
> 
> One possibility, if we disallow for m[i] to have been munmaped or page protected in the middle, is to first collect the first and last index for which the condition is true, and then run the vector loop only on that portion. For the loop above, I mean something like:
> 
> FirstI = -1, LastI = -1, I = 0;
> for (; I < N; ++I)
>  if (r[i] > 0) {
>    FirstI = I;
>    break;
>  }
> 
> for (; I < N; ++I) {
>  if (r[i] > 0)
>    LastI = I;
> }
> 
> [In this case, since the guard dominates all side-effect, we don't need to do anything outside that range, otherwise, we'd need to run the scalar loop until FirstI, and then again from LastI until N]
> 
> for (I = FirstI; I <= LastI; ...) {
>  // Run the scalar/vector loop sequence as before.
> }
> 
> On the other hand, if we can't make this range assumption generally, then I'd assume that we could make it per-page (where a page is 4kB or similar). Then the scheme becomes more complicated, because we need to break the ranges on page boundaries (maybe something like this):
> 
> Done = false;
> FirstI = -1, LastI = -1;
> while (!Done) {
>  for (I = FirstI+1; I < N; ++I)
>    if (r[i] > 0) {
>      FirstI = I;
>      break;
>    }
> 
>  for (; I < N && !page_bound(&m[i]) && ...; ++I) {
>    if (r[i] > 0)
>      LastI = I;
>  }
> 
>  Done = I == N;
> 
>  for (I = FirstI; I <= LastI; ...) {
>    // Run the scalar/vector loop sequence as before.
>  }  
> }
> 
> I hope that we don't need to do something like this, but I've been disappointed by C/POSIX semantics before. On the other hand, we might want to break these into groups anyway, and allocate a local stack buffer to hold the mask computed from the condition, so that we don't have to reevaluate the condition in the loop.
> 
> Also, I'm certainly open to other ways to do this. Are there any other ways? Thoughts?
> 
> Thanks again,
> Hal
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory