[LLVMdev] First attempt at recognizing pointer reduction

Mon Oct 21 12:58:55 PDT 2013

On Oct 21, 2013, at 1:00 PM, Renato Golin <renato.golin at linaro.org> wrote:

> Hi Arnold,
> 
> To sum up my intentions, I want to understand how the reduction/induction variable detection works in LLVM, so that I can know better how to detect different patterns in memory, not just the stride vectorization. 

To detect memory access patterns you will want to look at the SCEV of a pointer (or at a set of SCEVs/pointers). This way you get a canonical form.

For example these should be the SCEVs of “int a[2*i] = ; a[2*i+1] =”:

{ptr,   +, 8}_loop
{ptr+4, +, 8}_loop

Each access on its own requires a gather/scather (2 loads/stores when vectorized (VF=2) + inserts/extracts). But when we look at both at once we see that we only need two load/store in total (plus some interleaving operations).

What other patterns (than strided accesses) do you want to vectorize?

> 
> For instance, even if the relationship between each loop would be complicated, I know that in each loop, all three reads are sequential. So, at least, I could use a Load<3> rather than three loads.

Yes. Detecting this is what is described in the paper I referred in a post before (Auto-vectorization of interleaved data for SIMD). And what gcc is doing (AFAICT). You want to take a set of accesses in the loop and recognize that they are consecutive in memory (currently we look at every access on it is own, if an access is not sequential we assume we have to gather/scather). Once you know that you have consecutive accesses spread across different instructions you can generate more efficient code instead of scather/gathers. You would want to do the evaluation of which accesses are consecutive in SCEV I think. 

For your example, you want to recognize strided accesses (this is separate from the induction/reduction mechanism), first:

for (i = 0 .. n, +1)
 a[2*i] = …
 a[2*1+1] = …

You want this part first because without it the loop vectorizer is not going to vectorize because of the cost of gather/scather. But if it can recognize that in cases like this the “gather/scather” is not as costly and we emit the right code we will start to vectorize such loops.

Once we can do that. We need to support strided pointer inductions to get your example.

for (i = 0..n, +1)
 *a++=
 *a++=

> I guess this is why Nadav was hinting for the SLP vectorizer, not the loop vec. Since the operation on all three loaded values are exactly the same AND the writes are also sequential, I can reduce that to: load<3> -> op<3> ->
> store<3>. That should work even on machines that don't have de-interleaved access (assuming they're pointers to simple types, etc).

Getting this example in the slp vectorizer is easier but we won’t get the most efficient code (i.e. the one that gcc emits) because we would have <3 x i8> stores/loads. With vectorization of interleaved data you can load/store more elements (from several iterations) with a single load.

> 
> 
> On 21 October 2013 17:29, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:
> can you post a hand-created vectorized IR of how a reduction would work on your example?
> 
> I'm not there yet, just trying to understand the induction/reduction code first and what comes out of it.
> 
> 
> I think the right approach to get examples like this is that we need to recognize strided pointer inductions (we only support strides by one currently).
> 
> I see. I'll have a look at IK_PtrInduction and see what patterns I can spot.
> 
> Do you envisage a new IK type for strided induction, or just work with the PtrInduction to extend the concept of a non-unit stride (ie. separate Step from ElementSize)?

Either representation should be fine. I think the bigger task is not recognizing the induction but recognizing consecutive strided memory accesses, though. First, I think we want to be able to do:

for (i = 0 … n, +1)
  a[3*i] =
  a[3*i+1] =
  a[3*i+2] =

And next,

for (i = 0 … n, +1)
  *a++ =
  *a++ =
  *a++ =

Because to get the latter, you need the former.

Have you compared the performance of the kernel (gcc vectorized) you showed vs a scalar version? I would be curious about the speed-up.

Thanks,
Arnold