[LLVMdev] First attempt at recognizing pointer reduction
Renato Golin
renato.golin at linaro.org
Mon Oct 21 13:40:38 PDT 2013
On 21 October 2013 20:58, Arnold Schwaighofer <aschwaighofer at apple.com>wrote:
> For example these should be the SCEVs of “int a[2*i] = ; a[2*i+1] =”:
>
> {ptr, +, 8}_loop
> {ptr+4, +, 8}_loop
>
> Each access on its own requires a gather/scather (2 loads/stores when
> vectorized (VF=2) + inserts/extracts). But when we look at both at once we
> see that we only need two load/store in total (plus some interleaving
> operations).
>
Yes, I've been studying SCEV when trying to understand some other patterns
where the vectorizer was unable to detect the exit count (basically this
case, with a nested loop). It does make things easier to spot patterns in
the code.
The patch I attached here was not to help vectorize anything, but to let me
jump over the validation step, so that I could start working with the
patterns themselves during the actual vectorization. The review request was
only to understand if the checks I was making made sense, but it turned out
a lot more valuable than that.
Getting this example in the slp vectorizer is easier but we won’t get the
> most efficient code (i.e. the one that gcc emits) because we would have <3
> x i8> stores/loads. With vectorization of interleaved data you can
> load/store more elements (from several iterations) with a single load.
>
So, this was the other patterns I was looking for, as a stepping stone into
the full vectorizer. But I'm not sure this will help in any way the strided
access.
Either representation should be fine. I think the bigger task is not
> recognizing the induction but recognizing consecutive strided memory
> accesses, though. First, I think we want to be able to do:
>
> for (i = 0 … n, +1)
> a[3*i] =
> a[3*i+1] =
> a[3*i+2] =
>
> And next,
>
> for (i = 0 … n, +1)
> *a++ =
> *a++ =
> *a++ =
>
> Because to get the latter, you need the former.
>
Makes total sense. I'll change my approach.
Have you compared the performance of the kernel (gcc vectorized) you showed
> vs a scalar version? I would be curious about the speed-up.
>
4x faster, on both Cortex A9 and A15. :)
Thanks for the tips, I hope I can find more time to work on it this week,
since Linaro Connect is in the coming week and the US dev meeting is on the
next.
cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131021/c0d72575/attachment.html>
More information about the llvm-dev
mailing list