<div dir="ltr">Hi Elena,<div><br></div><div>Circling back to this, do you know of any concrete cases where enabling interleaved access on x86 is unprofitable?</div><div>Right now, there are some cases where we lose significantly, because (a) we consider gathers (on architectures that don't have them) extremely expensive, so we won't vectorize them at all without interleaved access, and (b) we have interleaved access turned off.<br></div><div><br></div><div>Consider something like this:</div><div><div><br></div><div><div>void foo(int *in, int *out) {</div><div> int i = 0;</div><div> for (i = 0; i < 256; ++i) {<br></div><div> out[i] = in[i] + in[i + 1] + in[i + 2] + in[i * 2];</div><div> }</div><div>}</div></div></div><div><br></div><div>We don't vectorize this loop at all, because we calculate the cost of the in[i * 2] gather to be 14 cycles per lane (!).</div><div>This is an overestimate we need to fix, since the vectorized code is actually fairly decent - e.g. forcing vectorization, with SSE4.2, we get:<br></div><div><div><br></div><div><div>.LBB0_3: # %vector.body</div><div> # =>This Inner Loop Header: Depth=1</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movdqu<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>(%rdi,%rax,4), %xmm3</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm0, %rcx</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movdqu<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>4(%rdi,%rcx,4), %xmm4</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>paddd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm3, %xmm4</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movdqu<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>8(%rdi,%rcx,4), %xmm3</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>paddd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm4, %xmm3</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movdqa<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm1, %xmm4</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>paddq<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm4, %xmm4</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movdqa<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm0, %xmm5</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>paddq<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm5, %xmm5</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm5, %rcx</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>pextrq<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>$1, %xmm5, %rdx</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm4, %r8</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>pextrq<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>$1, %xmm4, %r9</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>(%rdi,%rcx,4), %xmm4 # xmm4 = mem[0],zero,zero,zero</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>pinsrd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>$1, (%rdi,%rdx,4), %xmm4</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>pinsrd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>$2, (%rdi,%r8,4), %xmm4</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>pinsrd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>$3, (%rdi,%r9,4), %xmm4</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>paddd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm3, %xmm4</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movdqu<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm4, (%rsi,%rax,4)</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>addq<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>$4, %rax</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>paddq<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm2, %xmm0</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>paddq<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm2, %xmm1</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>cmpq<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>$256, %rax # imm = 0x100</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>jne<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>.LBB0_3</div></div></div><div><br></div><div>But the real point is that with interleaved access enabled, we vectorize, and get:</div><div><br></div><div><div>.LBB0_3: # %vector.body</div><div> # =>This Inner Loop Header: Depth=1</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movdqu<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>(%rdi,%rcx), %xmm0</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movdqu<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>4(%rdi,%rcx), %xmm1</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movdqu<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>8(%rdi,%rcx), %xmm2</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>paddd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm0, %xmm1</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>paddd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm2, %xmm1</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movdqu<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>(%rdi,%rcx,2), %xmm0</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movdqu<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>16(%rdi,%rcx,2), %xmm2</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>pshufd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>$132, %xmm2, %xmm2 # xmm2 = xmm2[0,1,0,2]</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>pshufd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>$232, %xmm0, %xmm0 # xmm0 = xmm0[0,2,2,3]</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>pblendw<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>$240, %xmm2, %xmm0 # xmm0 = xmm0[0,1,2,3],xmm2[4,5,6,7]</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>paddd<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm1, %xmm0</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>movdqu<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>%xmm0, (%rsi,%rcx)</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>cmpq<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>$992, %rcx # imm = 0x3E0</div><div><span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>jne<span class="gmail-m_-2363683590039465670m_-6696729873126710771gmail-Apple-tab-span" style="white-space:pre-wrap"> </span>.LBB0_7</div></div><div><br></div><div>The performance I see out of the 3 versions (with a 500K-iteration outer loop):<br></div><div><br></div><div>Scalar: 0m10.320s</div><div>Vector (Non-interleaved): 0m8.054s</div><div>Vector (Interleaved): 0m3.541s</div><div><br></div><div><div>This is far from being the perfect use case for interleaved access:</div><div>1) There's no real interleaving, just one strided gather, so this would be better served by Ashutosh's full "strided access" proposal.</div></div><div>2) It looks like the actual move + shuffle sequence is not better, and even probably worse, than just inserting directly from memory - but it's still worthwhile because of how much we save on the index computations.<br></div><div>Regardless of all that, the fact of the matter is that we get much better code by treating it as interleaved, and I think this may be a good enough motivation to enable it, unless we significantly regress in other cases.<br></div><div><br></div><div>I was going to look at benchmarks to see if we get any regressions, but if you already have examples you're aware of, that would be great.<br></div><div><br></div><div>Thanks,</div><div> Michael</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, May 26, 2016 at 12:35 PM, Demikhovsky, Elena via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Interleaved access is not enabled on X86 yet.<br>
We looked at this feature and got into conclusion that interleaving (as loads + shuffles) is not always profitable on X86. We should provide the right cost which depends on number of shuffles. Number of shuffles depends on permutations (shuffle mask). And even if we estimate the number of shuffles, the shuffles are not generated in-place. Vectorizer produces a long queue of "extracts" and "inserts" that hopefully will be coupled into shuffles on a later instcombine pass.<br>
<br>
- Elena<br>
<span class=""><br>
<br>
>-----Original Message-----<br>
>From: Renato Golin [mailto:<a href="mailto:renato.golin@linaro.org">renato.golin@linaro.org</a>]<br>
>Sent: Thursday, May 26, 2016 21:25<br>
>To: Sanjay Patel <<a href="mailto:spatel@rotateright.com">spatel@rotateright.com</a>>; Demikhovsky, Elena<br>
><<a href="mailto:elena.demikhovsky@intel.com">elena.demikhovsky@intel.com</a>><br>
>Cc: llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>
>Subject: Re: [llvm-dev] enabling interleaved access loop vectorization<br>
><br>
>On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev <llvm-<br>
><a href="mailto:dev@lists.llvm.org">dev@lists.llvm.org</a>> wrote:<br>
>> Is there a compile-time and/or potential runtime cost that makes<br>
>> enableInterleavedAccessVectorization() default to 'false'?<br>
>><br>
>> I notice that this is set to true for ARM, AArch64, and PPC.<br>
>><br>
>> In particular, I'm wondering if there's a reason it's not enabled for<br>
>> x86 in relation to PR27881:<br>
>> <a href="https://llvm.org/bugs/show_bug.cgi?id=27881" rel="noreferrer" target="_blank">https://llvm.org/bugs/show_bug.cgi?id=27881</a><br>
><br>
>Hi Sanjay,<br>
><br>
>The feature was originally developed for ARM's VLDn/VSTn instructions<br>
>and then extended to AArch64 and PPC, but not x86/64 yet.<br>
><br>
>I believe Elena was working on that, but needed to get the scatter/gather<br>
>intrinsics working first. I just copied her in case I'm wrong. :)<br>
><br>
>cheers,<br>
>--renato<br>
</span>---------------------------------------------------------------------<br>
Intel Israel (74) Limited<br>
<br>
This e-mail and any attachments may contain confidential material for<br>
the sole use of the intended recipient(s). Any review or distribution<br>
by others is strictly prohibited. If you are not the intended<br>
recipient, please contact the sender and delete all copies.<br>
<div class="HOEnZb"><div class="h5">_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</div></div></blockquote></div><br></div>