<html><head><style type='text/css'>p { margin: 0; }</style></head><body><div style='font-family: arial,helvetica,sans-serif; font-size: 10pt; color: #000000'><br><hr id="zwchr"><blockquote style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px; color: rgb(0, 0, 0); font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><b>From: </b>"Hao Liu" <Hao.Liu@arm.com><br><b>To: </b>aschwaighofer@apple.com, hfinkel@anl.gov, "Nadav Rotem" <nrotem@apple.com>, "Elena Demikhovsky" <elena.demikhovsky@intel.com><br><b>Cc: </b>llvm-commits@cs.uiuc.edu, "Jiangning Liu" <Jiangning.Liu@arm.com>, "James Molloy" <James.Molloy@arm.com><br><b>Sent: </b>Friday, March 20, 2015 6:47:52 AM<br><b>Subject: </b>[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses<br><br>
<style><!--
@font-face
{font-family:SimSun;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:SimSun;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"\@SimSun";
panose-1:2 1 6 0 3 1 1 1 1 1;}
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p
{mso-style-priority:99;
mso-margin-top-alt:auto;
margin-right:0cm;
mso-margin-bottom-alt:auto;
margin-left:0cm;
font-size:12.0pt;
font-family:"Times New Roman","serif";}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;}
span.apple-converted-space
{mso-style-name:apple-converted-space;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri","sans-serif";}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><div class="WordSection1"><p class="MsoNormal">Hi,</p><p class="MsoNormal"> </p><p class="MsoNormal">There are two patches attached can achieve this goal:</p><p class="MsoNormal"> LoopVectorize-InterleaveAccess.patch teaches Loop Vectorizer about interleaved data access and generate target independent intrinsics for each load/store:</p><p class="MsoNormal"> AArch64Backend-MatchIntrinsics.patch match several target independent intrinsics into one AArch64 ldN/stN intrinsics, so that AArch64 backend can generate ldN/stN instructions.</p><p class="MsoNormal"> </p><p class="MsoNormal">Currently, LoopVectorize can vectorize consecutive accesses well. It can vectorize loops like</p><p class="MsoNormal"> for (int i = 0; i < n; i++)</p><p class="MsoNormal"> sum += R[i];</p><p class="MsoNormal"> </p><p class="MsoNormal">But it doesn't handle strided access well. Interleaved access is a subset of strided access. Example for interleaved access:</p><p class="MsoNormal"> for (int i = 0; i < n; i++) {</p><p class="MsoNormal"> int even = A[2*i];</p><p class="MsoNormal"> int odd = A[2*i + 1];</p><p class="MsoNormal"> // do something with odd & even. </p><p class="MsoNormal"> }</p><p id="DWT1492" class="MsoNormal">To vectorize such case, we need two vectors: one with even elements, another with odd elements. To gather even elements, we need several scalar loads for "A[0], A[2], A[4], ...", and several INSERT_ELEMENTs to combine them together. The cost is very high and will usually prevent loop vectorization on such case. </p></div></blockquote><br>Perhaps this is a silly question, but why do you need interleaved load/store to support this? If we know that we need to access A[0], A[2], A[4], A[6], can't we generate two vector loads, one for A[0...3], and one for A[4...7], and then shuffle the results together. You need to leave the vector loop one iteration early (so you don't access off the end of the original access range), but that does not seem like a big loss. If I'm right, then I'd love to see this implemented in a way that can take advantage of interleaved load/store on targets that support them, but not require target support.<br><br>Thanks again,<br>Hal<br><br><blockquote style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px; color: rgb(0, 0, 0); font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><div class="WordSection1"><p class="MsoNormal">Some backend like AArch64/ARM support interleaved load/store: ldN/stN (N is 2/3/4). And I know X86 can also support similar operations. One ld2 can load two vectors: one is with even elements, another is with only odd elements. So that this case can be vectorized into AArch64 instructions:</p><p class="MsoNormal"> LD2 { V0, V1 } [X0]</p><p class="MsoNormal"> // V0 contains even elements. Do something related to even elements with V0.</p><p class="MsoNormal"> // V1 contains odd elements. Do something related to odd elements with V1.</p><p class="MsoNormal"> </p><p class="MsoNormal"> </p><p class="MsoNormal">1. Design</p><p class="MsoNormal">My design is to follow current Loop Vecotirzer three phases.</p><p class="MsoNormal">(1) Legality Phase:</p><p class="MsoNormal"> (a). Collect all the constant strided accesses except consecutive accesses. </p><p class="MsoNormal"> (b). Collect the load/store accesses with the same Stride, Base pointer. </p><p class="MsoNormal"> (c). Fine the consecutive chains in (b). If the number of accesses in one chain are equal to the Stride, they are interleaved accesses.</p><p class="MsoNormal">Example for the case about even and odd. We can find two loads for even and odd elements. The strides are both 2. They are also consecutive. So they are recorded as interleaved accesses.</p><p class="MsoNormal"> </p><p class="MsoNormal">(2) Profitability Phase:</p><p class="MsoNormal"> Add a target hook to calculate the cost. Currently the cost is 1. Currently this won't affect to much about the result. So I didn't do too much work in this phase.</p><p class="MsoNormal"> </p><p class="MsoNormal">(3) Transform Phase: </p><p class="MsoNormal"> As there is no IR for interleaved data, I think we should use intrinsics. The problem is that the relationship is "N to one". I.E. Several loads/stores to one ldN/stN instructions. There is already ldN/stN intrinsics in AArch64/ARM backend such llvm.aarch64.neon.ldN, which is like "call { <4 x i32>, <4 x i32>} llvm.aarch64.neon.ld2.v4i32()". In the middle end, there are two IR loads. </p><p class="MsoNormal"></p><p class="MsoNormal">Need to think a way to match two loads to one target specific intrinsic. I think there are two ways:</p><p class="MsoNormal"> (a). Two steps for middle end and backend. 1st step is to match each loads/stores to one target independent intrinsic in the loop vectorize. 2nd step is to match several intrinsics into one ldN/stN intrinsic. This is the choise of my attached patch. For the above odd-even example, it will generate two intrinsics in the loop vectorization: </p><p class="MsoNormal"> "%even-elements = call <4 x i32> @llvm.interleave.load.v4i32", </p><p class="MsoNormal"> "%odd-elements = call <4 x i32> @llvm.interleave.load.v4i32".</p><p class="MsoNormal">A backend pass will combine them together into one intrinsic:</p><p class="MsoNormal"> "%even-odd-elements = call { <4 x i32>, <4 x i32> } @llvm.aarch64.neon.ld2.v4i32"</p><p class="MsoNormal">But I think the backend pass is vulnerable and diffecult to implement. It will fail to combine if one load is missing, or if one load is moved to another basic block. Also I haven't check about memory dependency.</p><p class="MsoNormal"> (b). One step only for middle end. We can match several load/stores into one ldN/stN like target independent intrinsic. So that the AArch64/ARM backend only needs slight modification on replacing the current used intrinsic to the new independent intrinsic. This needs to introduce a new intrinsic such as "{ <4 x i32>, <4 x i32>} llvm.interleaved.load.v4i32()".</p><p class="MsoNormal"> </p><p class="MsoNormal"> Actually I prefer solution (b), which is easier to be implemented and stronger than solution (a). But solution (a) seems more target independent. How do you guys think?</p><p class="MsoNormal"> </p><p class="MsoNormal">2. Test</p><p class="MsoNormal">I've test the attached patch with llvm-test-suit, SPEC2000, SPEC2006, EEMBC, Geekbench on AArch64 machine. They all can pass.</p><p class="MsoNormal">But the performance is not affected. Some specific benchmarks like EEMBC rgbcmy and EEMBC rgbyiq are expected to have several times of acceleration. The reason is that there are other issues prevent vectorization opportunities. Some known issues are:</p><p class="MsoNormal"> (1). Too many unnecessary runtime checkings (The interleaved accesses are compared with each other).</p><p class="MsoNormal"> (2). Store-Load Forward checking (Doesn't consider about strided accesses).</p><p class="MsoNormal"> (3). Type promotion issue (i8 is illegal but <16 x i8> is legal. i8 is promoted to i32 so the extend and truncate operations increase the total cost).</p><p class="MsoNormal"> (4). The Vector Factor is selected according to the widest type. (If there are both i8 and i32, we select a small factor according to i32 rather than according to i8).</p><p class="MsoNormal">Anyway. We can fix them in the future and get performance improvements.</p><p class="MsoNormal"> </p><p class="MsoNormal">What's your oppions on the solution? I'm still hesitating about the transform phase.</p><p class="MsoNormal"> </p><p class="MsoNormal">Thanks,</p><p class="MsoNormal">-Hao</p></div></blockquote><br>-- <br><div><span name="x"></span>Hal Finkel<br>Assistant Computational Scientist<br>Leadership Computing Facility<br>Argonne National Laboratory<span name="x"></span><br></div></div></body></html>