<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

<meta name="Generator" content="Microsoft Word 15 (filtered medium)">

<style><!--

/* Font Definitions */

@font-face

        {font-family:"MS Mincho";

        panose-1:2 2 6 9 4 2 5 8 3 4;}

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Consolas;

        panose-1:2 11 6 9 2 2 4 3 2 4;}

@font-face

        {font-family:"\@MS Mincho";

        panose-1:2 2 6 9 4 2 5 8 3 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman",serif;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

pre

        {mso-style-priority:99;

        mso-style-link:"HTML Preformatted Char";

        margin:0in;

        margin-bottom:.0001pt;

        font-size:10.0pt;

        font-family:"Courier New";}

p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph

        {mso-style-priority:34;

        margin-top:0in;

        margin-right:0in;

        margin-bottom:0in;

        margin-left:.5in;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman",serif;}

span.HTMLPreformattedChar

        {mso-style-name:"HTML Preformatted Char";

        mso-style-priority:99;

        mso-style-link:"HTML Preformatted";

        font-family:Consolas;}

span.EmailStyle19

        {mso-style-type:personal-reply;

        font-family:"Calibri",sans-serif;

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-family:"Calibri",sans-serif;}

@page WordSection1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}

/* List Definitions */

@list l0

        {mso-list-id:1688824935;

        mso-list-type:hybrid;

        mso-list-template-ids:51669138 67698705 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}

@list l0:level1

        {mso-level-text:"%1\)";

        mso-level-tab-stop:none;

        mso-level-number-position:left;

        text-indent:-.25in;}

@list l0:level2

        {mso-level-number-format:alpha-lower;

        mso-level-tab-stop:none;

        mso-level-number-position:left;

        text-indent:-.25in;}

@list l0:level3

        {mso-level-number-format:roman-lower;

        mso-level-tab-stop:none;

        mso-level-number-position:right;

        text-indent:-9.0pt;}

@list l0:level4

        {mso-level-tab-stop:none;

        mso-level-number-position:left;

        text-indent:-.25in;}

@list l0:level5

        {mso-level-number-format:alpha-lower;

        mso-level-tab-stop:none;

        mso-level-number-position:left;

        text-indent:-.25in;}

@list l0:level6

        {mso-level-number-format:roman-lower;

        mso-level-tab-stop:none;

        mso-level-number-position:right;

        text-indent:-9.0pt;}

@list l0:level7

        {mso-level-tab-stop:none;

        mso-level-number-position:left;

        text-indent:-.25in;}

@list l0:level8

        {mso-level-number-format:alpha-lower;

        mso-level-tab-stop:none;

        mso-level-number-position:left;

        text-indent:-.25in;}

@list l0:level9

        {mso-level-number-format:roman-lower;

        mso-level-tab-stop:none;

        mso-level-number-position:right;

        text-indent:-9.0pt;}

ol

        {margin-bottom:0in;}

ul

        {margin-bottom:0in;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

</head>

<body lang="EN-US" link="blue" vlink="purple">

<div class="WordSection1">

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">My perspective, being a vectorizer guy, is that vectorizer should<o:p></o:p></span></p>

<p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><span style="mso-list:Ignore">1)<span style="font:7.0pt "Times New Roman"">     

</span></span></span><![endif]><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Take this optimization into account in the cost modeling so that it favors full vector compute.<o:p></o:p></span></p>

<p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><span style="mso-list:Ignore">2)<span style="font:7.0pt "Times New Roman"">     

</span></span></span><![endif]><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">but generate plain widened computation:<br>

full vector unit stride load of A[],<br>

full vector unit stride load of B[],<br>

sign extend both, (this makes it 2x full vector, on the surface)<br>

multiply<br>

add<br>

…<br>

standard reduction last value sequence after the loop<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">and let downstream optimizer, possibly in Target, use instructions like (v)pmaddwd effectively.<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">If needed, an IR-to-IR xform before hitting Target.<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">This mechanism also works if a programmer or other FE produces a similar naïvely vectorized IR like above.<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Since vectorizer should understand the existence of the optimization, it can certainly be arm-twisted to<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">generate the IR desired by the Target. However, whether we want to do that is a totally different story.<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Vectorizer should focus on having reasonable cost model and generating straight-forward optimizable IR<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">---- as opposed to generating convoluted IR (such as breaking up unit-stride load into even/odd, simply<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">to put them back to unit-stride again) wanted by the Target.<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">My recommendation is first analyzing the source of the current code generation deficiencies and then<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">try to remedy it there.<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Thanks,<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Hideki<o:p></o:p></span></p>

<p class="MsoNormal"><a name="_MailEndCompose"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></a></p>

<p class="MsoNormal"><a name="_____replyseparator"></a><b><span style="font-size:11.0pt;font-family:"Calibri",sans-serif">From:</span></b><span style="font-size:11.0pt;font-family:"Calibri",sans-serif"> Craig Topper [mailto:craig.topper@gmail.com]

<br>

<b>Sent:</b> Monday, July 23, 2018 4:37 PM<br>

<b>To:</b> Hal Finkel <hfinkel@anl.gov><br>

<b>Cc:</b> Saito, Hideki <hideki.saito@intel.com>; estotzer@ti.com; Nemanja Ivanovic <nemanja.i.ibm@gmail.com>; Adam Nemet <anemet@apple.com>; graham.hunter@arm.com; Michael Kuperstein <mkuper@google.com>; Sanjay Patel <spatel@rotateright.com>; Simon Pilgrim

 <llvm-dev@redking.me.uk>; ashutosh.nema@amd.com; llvm-dev <llvm-dev@lists.llvm.org><br>

<b>Subject:</b> Re: [LoopVectorizer] Improving the performance of dot product reduction loop<o:p></o:p></span></p>

<p class="MsoNormal"><o:p> </o:p></p>

<div>

<p class="MsoNormal"><br clear="all">

<o:p></o:p></p>

<div>

<div>

<p class="MsoNormal">~Craig<o:p></o:p></p>

</div>

</div>

<p class="MsoNormal" style="margin-bottom:12.0pt"><o:p> </o:p></p>

<div>

<div>

<p class="MsoNormal">On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>> wrote:<o:p></o:p></p>

</div>

<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">

<div>

<p class="MsoNormal"><o:p> </o:p></p>

<div>

<p class="MsoNormal">On 07/23/2018 05:22 PM, Craig Topper wrote:<o:p></o:p></p>

</div>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<div>

<div>

<p class="MsoNormal">Hello all,<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">This code <a href="https://godbolt.org/g/tTyxpf" target="_blank">

https://godbolt.org/g/tTyxpf</a> is a dot product reduction loop multipying sign extended 16-bit values to produce a 32-bit accumulated result. The x86 backend is currently not able to optimize it as well as gcc and icc. The IR we are getting from the loop

 vectorizer has several v8i32 adds and muls inside the loop. These are fed by v8i16 loads and sexts from v8i16 to v8i32. The x86 backend recognizes that these are addition reductions of multiplication so we use the vpmaddwd instruction which calculates 32-bit

 products from 16-bit inputs and does a horizontal add of adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a v4i32 result.<o:p></o:p></p>

</div>

</div>

</blockquote>

</div>

</blockquote>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">That godbolt link seems wrong. It wasn't supposed to be clang IR. This should be right.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"> <o:p></o:p></p>

</div>

<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">

<div>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<div>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">In the example code, because we are reducing the number of elements from 8->4 in the vpmaddwd step we are left with a width mismatch between vpmaddwd and the vpaddd instruction that we use to sum with the results from the previous loop

 iterations. We rely on the fact that a 128-bit vpmaddwd zeros the upper bits of the register so that we can use a 256-bit vpaddd instruction so that the upper elements can keep going around the loop without being disturbed in case they weren't initialized

 to 0. But this still means the vpmaddwd instruction is doing half the amount of work the CPU is capable of if we had been able to use a 256-bit vpmaddwd instruction. Additionally, future x86 CPUs will be gaining an instruction that can do VPMADDWD and VPADDD

 in one instruction, but that width mismatch makes that instruction difficult to utilize.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">In order for the backend to handle this better it would be great if we could have something like two v32i8 loads, two shufflevectors to extract the even elements and the odd elements to create four v16i8 pieces.<o:p></o:p></p>

</div>

</div>

</blockquote>

<p class="MsoNormal"><br>

Why v*i8 loads? I thought that we have 16-bit and 32-bit types here?<o:p></o:p></p>

</div>

</blockquote>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">Oops that should have been v16i16. Mixed up my 256-bit types.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"> <o:p></o:p></p>

</div>

<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">

<div>

<p class="MsoNormal"><br>

<br>

<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<div>

<div>

<p class="MsoNormal">Sign extend each of those pieces. Multiply the two even pieces and the two odd pieces separately, sum those results with a v8i32 add. Then another v8i32 add to accumulate the previous loop iterations. Then ensures that no pieces exceed

 the target vector width and the final operation is correctly sized to go around the loop in one register. All but the last add can then be pattern matched to vpmaddwd as proposed in <a href="https://reviews.llvm.org/D49636" target="_blank">https://reviews.llvm.org/D49636</a>.

 And for the future CPU the whole thing can be matched to the new instruction.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">Do other targets have a similar instruction or a similar issue to this? Is this something we can solve in the loop vectorizer? Or should we have a separate IR transformation that can recognize this pattern and generate the new sequence?

 As a separate pass we would need to pair two vector loads together, remove a reduction step outside the loop and remove half the phis assuming the loop was partially unrolled. Or if there was only one add/mul inside the loop we'd have to reduce its width and

 the width of the phi.<o:p></o:p></p>

</div>

</div>

</blockquote>

<p class="MsoNormal"><br>

Can you explain how the desired code from the vectorizer differs from the code that the vectorizer produces if you add '#pragma clang loop vectorize(enable) vectorize_width(16)'  above the loop? I tried it in your godbolt example and the generated code looks

 very similar to the icc-generated code.<o:p></o:p></p>

</div>

</blockquote>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<div>

<p class="MsoNormal">It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being unnecessarily carried across the loop. It's then redundantly added twice in the reduction after the loop despite it being 0. This happens because we basically tricked the backend

 into generating a 256-bit vpmaddwd concated with a 256-bit zero vector going into a 512-bit vaddd before type legalization. The 512-bit concat and vpaddd get split during type legalization, and the high half of the add gets constant folded away. I'm guessing

 we probably finished with 4 vpxors before the loop but MachineCSE(or some other pass?) combined two of them when it figured out the loop didn't modify them.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"> <o:p></o:p></p>

</div>

<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">

<div>

<p class="MsoNormal"><br>

Thanks again,<br>

Hal<br>

<br>

<br>

<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<div>

<div>

<p class="MsoNormal"><o:p> </o:p></p>

</div>

<p class="MsoNormal">Thanks,<br clear="all">

<o:p></o:p></p>

<div>

<div>

<p class="MsoNormal">~Craig<o:p></o:p></p>

</div>

</div>

</div>

</blockquote>

<p class="MsoNormal"><br>

<br>

<o:p></o:p></p>

<pre>-- <o:p></o:p></pre>

<pre>Hal Finkel<o:p></o:p></pre>

<pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>

<pre>Leadership Computing Facility<o:p></o:p></pre>

<pre>Argonne National Laboratory<o:p></o:p></pre>

</div>

</blockquote>

</div>

</div>

</div>

</body>

</html>