[LLVMdev] First attempt at recognizing pointer reduction

Arnold Schwaighofer aschwaighofer at apple.com
Mon Oct 21 13:51:53 PDT 2013


Whenever you find time to work on this will be of great help!

I wanted to work on this at some point. But my near time tasks won’t allow. So I much appreciate any help!


Thanks,
Arnold

On Oct 21, 2013, at 3:40 PM, Renato Golin <renato.golin at linaro.org> wrote:

> On 21 October 2013 20:58, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:
> For example these should be the SCEVs of “int a[2*i] = ; a[2*i+1] =”:
> 
> {ptr,   +, 8}_loop
> {ptr+4, +, 8}_loop
> 
> Each access on its own requires a gather/scather (2 loads/stores when vectorized (VF=2) + inserts/extracts). But when we look at both at once we see that we only need two load/store in total (plus some interleaving operations).
> 
> 
> Yes, I've been studying SCEV when trying to understand some other patterns where the vectorizer was unable to detect the exit count (basically this case, with a nested loop). It does make things easier to spot patterns in the code.
> 
> The patch I attached here was not to help vectorize anything, but to let me jump over the validation step, so that I could start working with the patterns themselves during the actual vectorization. The review request was only to understand if the checks I was making made sense, but it turned out a lot more valuable than that.
> 
> 
> Getting this example in the slp vectorizer is easier but we won’t get the most efficient code (i.e. the one that gcc emits) because we would have <3 x i8> stores/loads. With vectorization of interleaved data you can load/store more elements (from several iterations) with a single load.
> 
> So, this was the other patterns I was looking for, as a stepping stone into the full vectorizer. But I'm not sure this will help in any way the strided access.
> 
> 
> 
> Either representation should be fine. I think the bigger task is not recognizing the induction but recognizing consecutive strided memory accesses, though. First, I think we want to be able to do:
> 
> for (i = 0 … n, +1)
>   a[3*i] =
>   a[3*i+1] =
>   a[3*i+2] =
> 
> And next,
> 
> for (i = 0 … n, +1)
>   *a++ =
>   *a++ =
>   *a++ =
> 
> Because to get the latter, you need the former.
> 
> Makes total sense. I'll change my approach.
> 
> 
> Have you compared the performance of the kernel (gcc vectorized) you showed vs a scalar version? I would be curious about the speed-up.
> 
> 4x faster, on both Cortex A9 and A15. :)

Nice.

> 
> Thanks for the tips, I hope I can find more time to work on it this week, since Linaro Connect is in the coming week and the US dev meeting is on the next.
> 
> cheers,
> --renato





More information about the llvm-dev mailing list