[LLVMdev] First attempt at recognizing pointer reduction

Mon Oct 21 13:40:38 PDT 2013

On 21 October 2013 20:58, Arnold Schwaighofer <aschwaighofer at apple.com>wrote:

> For example these should be the SCEVs of “int a[2*i] = ; a[2*i+1] =”:
>
> {ptr,   +, 8}_loop
> {ptr+4, +, 8}_loop
>
> Each access on its own requires a gather/scather (2 loads/stores when
> vectorized (VF=2) + inserts/extracts). But when we look at both at once we
> see that we only need two load/store in total (plus some interleaving
> operations).
>

Yes, I've been studying SCEV when trying to understand some other patterns
where the vectorizer was unable to detect the exit count (basically this
case, with a nested loop). It does make things easier to spot patterns in
the code.

The patch I attached here was not to help vectorize anything, but to let me
jump over the validation step, so that I could start working with the
patterns themselves during the actual vectorization. The review request was
only to understand if the checks I was making made sense, but it turned out
a lot more valuable than that.

Getting this example in the slp vectorizer is easier but we won’t get the
> most efficient code (i.e. the one that gcc emits) because we would have <3
> x i8> stores/loads. With vectorization of interleaved data you can
> load/store more elements (from several iterations) with a single load.
>

So, this was the other patterns I was looking for, as a stepping stone into
the full vectorizer. But I'm not sure this will help in any way the strided
access.

Either representation should be fine. I think the bigger task is not
> recognizing the induction but recognizing consecutive strided memory
> accesses, though. First, I think we want to be able to do:
>
> for (i = 0 … n, +1)
>   a[3*i] =
>   a[3*i+1] =
>   a[3*i+2] =
>
> And next,
>
> for (i = 0 … n, +1)
>   *a++ =
>   *a++ =
>   *a++ =
>
> Because to get the latter, you need the former.
>

Makes total sense. I'll change my approach.

Have you compared the performance of the kernel (gcc vectorized) you showed
> vs a scalar version? I would be curious about the speed-up.
>

4x faster, on both Cortex A9 and A15. :)

Thanks for the tips, I hope I can find more time to work on it this week,
since Linaro Connect is in the coming week and the US dev meeting is on the
next.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131021/c0d72575/attachment.html>