[llvm-dev] enabling interleaved access loop vectorization

Fri Aug 5 10:05:10 PDT 2016

Regarding InterleavedAccessPass - sure, but proper strided/interleaved
access optimization ought to have a positive impact even without target
support.
Case in point - Hal enabled it on PPC last September. An important
difference vs. x86 seems to be that arbitrary shuffles are cheap on PPC,
but, as I said below, I hope we can enable it on x86 with a conservative
cost function, and still get improvement.

On Fri, Aug 5, 2016 at 7:02 AM, Matthew Simpson <mssimpso at codeaurora.org>
wrote:

> Isn't our current interleaved access vectorization just a special case of
> the more general strided access proposal? If so, from a development
> perspective, it might make sense to begin incorporating some of that work
> into the existing framework (with appropriate target hooks and costs). This
> could probably be done piecemeal rather than all at once.
>
>
>
> Also, keep in mind that ARM/Aarch64 run an additional IR pass
> (InterleavedAccessPass) that matches the load/store plus shuffle sequences
> that the vectorizer generates to target-specific instrinsics.
>
>
>
> -- Matt
>
>
>
>
>
> *From:* Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
> *Sent:* Friday, August 05, 2016 7:21 AM
> *To:* Michael Kuperstein <mkuper at google.com>; Demikhovsky, Elena <
> elena.demikhovsky at intel.com>
> *Cc:* Renato Golin <renato.golin at linaro.org>; Sanjay Patel <
> spatel at rotateright.com>; Matthew Simpson <mssimpso at codeaurora.org>;
> llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* RE: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
> Hi Michael,
>
>
>
> Sometime back I did some experiments with interleave vectorizer and did
> not found any degrade,
>
> probably my tests/benchmarks are not extensive enough to cover much.
>
>
>
> Elina is the right person to comment on it as she already experienced
> cases where it hinders performance.
>
>
>
> For interleave vectorizer on X86 we do not have any specific costing, it
> goes to BasicTTI where the costing is not appropriate(WRT X86).
>
> It consider cost of extracts & inserts for extracting elements from a wide
> vector, which is really expensive.
>
> i.e. in your test case the cost of load associated with “in[i * 2]” is 10
> (for VF4).
>
> Interleave vectorize will generate following instructions for it:
>
>   %wide.vec = load <8 x i32>, <8 x i32>* %14, align 4, !tbaa !1,
> !alias.scope !5
>
>   %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x
> i32> <i32 0, i32 2, i32 4, i32 6>
>
>
>
> For wide load it get cost as 2(as it has to generate 2 loads) but for
> extracting elements (shuffle operation) it get cost as 8 (4 for extract + 4
> for insert).
>
> The cost should be 3 here, 2 for loads & 1 for shuffle.
>
>
>
> To enable Interleave vectorizer on X86 we should implement a proper cost
> estimation.
>
>
>
> Test you mentioned is indeed a candidate for Stride memory vectorization.
>
>
>
> Regards,
>
> Ashutosh
>
>
>
> *From:* Michael Kuperstein [mailto:mkuper at google.com <mkuper at google.com>]
> *Sent:* Friday, August 5, 2016 4:53 AM
> *To:* Demikhovsky, Elena <elena.demikhovsky at intel.com>
> *Cc:* Renato Golin <renato.golin at linaro.org>; Sanjay Patel <
> spatel at rotateright.com>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; Matthew
> Simpson <mssimpso at codeaurora.org>; llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
> Hi Elena,
>
>
>
> Circling back to this, do you know of any concrete cases where enabling
> interleaved access on x86 is unprofitable?
>
> Right now, there are some cases where we lose significantly, because (a)
> we consider gathers (on architectures that don't have them) extremely
> expensive, so we won't vectorize them at all without interleaved access,
> and (b) we have interleaved access turned off.
>
>
>
> Consider something like this:
>
>
>
> void foo(int *in, int *out) {
>
>   int i = 0;
>
>   for (i = 0; i < 256; ++i) {
>
>     out[i] = in[i] + in[i + 1] + in[i + 2] + in[i * 2];
>
>   }
>
> }
>
>
>
> We don't vectorize this loop at all, because we calculate the cost of the
> in[i * 2] gather to be 14 cycles per lane (!).
>
> This is an overestimate we need to fix, since the vectorized code is
> actually fairly decent - e.g. forcing vectorization, with SSE4.2, we get:
>
>
>
> .LBB0_3:                                # %vector.body
>
>                                         # =>This Inner Loop Header: Depth=1
>
> movdqu (%rdi,%rax,4), %xmm3
>
> movd %xmm0, %rcx
>
> movdqu 4(%rdi,%rcx,4), %xmm4
>
> paddd %xmm3, %xmm4
>
> movdqu 8(%rdi,%rcx,4), %xmm3
>
> paddd %xmm4, %xmm3
>
> movdqa %xmm1, %xmm4
>
> paddq %xmm4, %xmm4
>
> movdqa %xmm0, %xmm5
>
> paddq %xmm5, %xmm5
>
> movd %xmm5, %rcx
>
> pextrq $1, %xmm5, %rdx
>
> movd %xmm4, %r8
>
> pextrq $1, %xmm4, %r9
>
> movd (%rdi,%rcx,4), %xmm4    # xmm4 = mem[0],zero,zero,zero
>
> pinsrd $1, (%rdi,%rdx,4), %xmm4
>
> pinsrd $2, (%rdi,%r8,4), %xmm4
>
> pinsrd $3, (%rdi,%r9,4), %xmm4
>
> paddd %xmm3, %xmm4
>
> movdqu %xmm4, (%rsi,%rax,4)
>
> addq $4, %rax
>
> paddq %xmm2, %xmm0
>
> paddq %xmm2, %xmm1
>
> cmpq $256, %rax              # imm = 0x100
>
> jne .LBB0_3
>
>
>
> But the real point is that with interleaved access enabled, we vectorize,
> and get:
>
>
>
> .LBB0_3:                                # %vector.body
>
>                                         # =>This Inner Loop Header: Depth=1
>
> movdqu (%rdi,%rcx), %xmm0
>
> movdqu 4(%rdi,%rcx), %xmm1
>
> movdqu 8(%rdi,%rcx), %xmm2
>
> paddd %xmm0, %xmm1
>
> paddd %xmm2, %xmm1
>
> movdqu (%rdi,%rcx,2), %xmm0
>
> movdqu 16(%rdi,%rcx,2), %xmm2
>
> pshufd $132, %xmm2, %xmm2      # xmm2 = xmm2[0,1,0,2]
>
> pshufd $232, %xmm0, %xmm0      # xmm0 = xmm0[0,2,2,3]
>
> pblendw $240, %xmm2, %xmm0      # xmm0 = xmm0[0,1,2,3],xmm2[4,5,6,7]
>
> paddd %xmm1, %xmm0
>
> movdqu %xmm0, (%rsi,%rcx)
>
> cmpq $992, %rcx              # imm = 0x3E0
>
> jne .LBB0_7
>
>
>
> The performance I see out of the 3 versions (with a 500K-iteration outer
> loop):
>
>
>
> Scalar: 0m10.320s
>
> Vector (Non-interleaved): 0m8.054s
>
> Vector (Interleaved): 0m3.541s
>
>
>
> This is far from being the perfect use case for interleaved access:
>
> 1) There's no real interleaving, just one strided gather, so this would be
> better served by Ashutosh's full "strided access" proposal.
>
> 2) It looks like the actual move + shuffle sequence is not better, and
> even probably worse, than just inserting directly from memory - but it's
> still worthwhile because of how much we save on the index computations.
>
> Regardless of all that, the fact of the matter is that we get much better
> code by treating it as interleaved, and I think this may be a good enough
> motivation to enable it, unless we significantly regress in other cases.
>
>
>
> I was going to look at benchmarks to see if we get any regressions, but if
> you already have examples you're aware of, that would be great.
>
>
>
> Thanks,
>
>   Michael
>
>
>
> On Thu, May 26, 2016 at 12:35 PM, Demikhovsky, Elena via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Interleaved access is not enabled on X86 yet.
> We looked at this feature and got into conclusion that interleaving (as
> loads + shuffles) is not always profitable on X86. We should provide the
> right cost which depends on number of shuffles. Number of shuffles depends
> on permutations (shuffle mask). And even if we estimate the number of
> shuffles, the shuffles are not generated in-place. Vectorizer produces a
> long queue of "extracts" and "inserts" that hopefully will be coupled into
> shuffles on a later instcombine pass.
>
> -  Elena
>
>
>    >-----Original Message-----
>    >From: Renato Golin [mailto:renato.golin at linaro.org]
>    >Sent: Thursday, May 26, 2016 21:25
>    >To: Sanjay Patel <spatel at rotateright.com>; Demikhovsky, Elena
>    ><elena.demikhovsky at intel.com>
>    >Cc: llvm-dev <llvm-dev at lists.llvm.org>
>    >Subject: Re: [llvm-dev] enabling interleaved access loop vectorization
>    >
>    >On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev <llvm-
>    >dev at lists.llvm.org> wrote:
>    >> Is there a compile-time and/or potential runtime cost that makes
>    >> enableInterleavedAccessVectorization() default to 'false'?
>    >>
>    >> I notice that this is set to true for ARM, AArch64, and PPC.
>    >>
>    >> In particular, I'm wondering if there's a reason it's not enabled for
>    >> x86 in relation to PR27881:
>    >> https://llvm.org/bugs/show_bug.cgi?id=27881
>    >
>    >Hi Sanjay,
>    >
>    >The feature was originally developed for ARM's VLDn/VSTn instructions
>    >and then extended to AArch64 and PPC, but not x86/64 yet.
>    >
>    >I believe Elena was working on that, but needed to get the
> scatter/gather
>    >intrinsics working first. I just copied her in case I'm wrong. :)
>    >
>    >cheers,
>    >--renato
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160805/b465acde/attachment-0001.html>