<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <br>
    <div class="moz-cite-prefix">On 07/23/2018 08:25 PM, Saito, Hideki
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:899F03F2C73A55449C51631866B88749619D4C7C@FMSMSX109.amr.corp.intel.com">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <meta name="Generator" content="Microsoft Word 15 (filtered
        medium)">
      <style><!--
/* Font Definitions */
@font-face
        {font-family:"MS Mincho";
        panose-1:2 2 6 9 4 2 5 8 3 4;}
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Consolas;
        panose-1:2 11 6 9 2 2 4 3 2 4;}
@font-face
        {font-family:"\@MS Mincho";
        panose-1:2 2 6 9 4 2 5 8 3 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
pre
        {mso-style-priority:99;
        mso-style-link:"HTML Preformatted Char";
        margin:0in;
        margin-bottom:.0001pt;
        font-size:10.0pt;
        font-family:"Courier New";}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0in;
        margin-right:0in;
        margin-bottom:0in;
        margin-left:.5in;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman",serif;}
span.HTMLPreformattedChar
        {mso-style-name:"HTML Preformatted Char";
        mso-style-priority:99;
        mso-style-link:"HTML Preformatted";
        font-family:Consolas;}
span.EmailStyle19
        {mso-style-type:personal-reply;
        font-family:"Calibri",sans-serif;
        color:#1F497D;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri",sans-serif;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
/* List Definitions */
@list l0
        {mso-list-id:1688824935;
        mso-list-type:hybrid;
        mso-list-template-ids:51669138 67698705 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
        {mso-level-text:"%1\)";
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level2
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level3
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level4
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level5
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level6
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
@list l0:level7
        {mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level8
        {mso-level-number-format:alpha-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:left;
        text-indent:-.25in;}
@list l0:level9
        {mso-level-number-format:roman-lower;
        mso-level-tab-stop:none;
        mso-level-number-position:right;
        text-indent:-9.0pt;}
ol
        {margin-bottom:0in;}
ul
        {margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
      <div class="WordSection1">
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">My
            perspective, being a vectorizer guy, is that vectorizer
            should<o:p></o:p></span></p>
        <p class="MsoListParagraph"
          style="text-indent:-.25in;mso-list:l0 level1 lfo1"><!--[if !supportLists]--><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><span
              style="mso-list:Ignore">1)<span style="font:7.0pt
                "Times New Roman"">     
              </span></span></span><!--[endif]--><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Take
            this optimization into account in the cost modeling so that
            it favors full vector compute.<o:p></o:p></span></p>
        <p class="MsoListParagraph"
          style="text-indent:-.25in;mso-list:l0 level1 lfo1"><!--[if !supportLists]--><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><span
              style="mso-list:Ignore">2)<span style="font:7.0pt
                "Times New Roman"">     
              </span></span></span><!--[endif]--><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">but
            generate plain widened computation:<br>
            full vector unit stride load of A[],<br>
            full vector unit stride load of B[],<br>
            sign extend both, (this makes it 2x full vector, on the
            surface)<br>
            multiply<br>
            add<br>
            …<br>
            standard reduction last value sequence after the loop<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">and
            let downstream optimizer, possibly in Target, use
            instructions like (v)pmaddwd effectively.<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">If
            needed, an IR-to-IR xform before hitting Target.<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">This
            mechanism also works if a programmer or other FE produces a
            similar naïvely vectorized IR like above.</span></p>
      </div>
    </blockquote>
    <br>
    I think that this should be, generally, our strategy. We might have
    different reduction strategies (we already do, at least in terms of
    the final reduction tree), and one might include what x86 wants
    here, so long as we can reasonably create a cost-modeling interface
    that let's us differentiate it from other strategies at the IR
    level. Lacking the ability to abstract this behind a generalized
    strategy with an IR-level cost-modeling interface, I think that the
    vectorizer should produce straightforward IR (e.g., what we
    currently produce with VF=16, see the other discussion of the
    vectorizer-maximize-bandwidth option) and then the target can adjust
    it as necessary to take advantage of special isel opportunities.<br>
    <br>
    Thanks again,<br>
    Hal<br>
    <br>
    <blockquote type="cite"
cite="mid:899F03F2C73A55449C51631866B88749619D4C7C@FMSMSX109.amr.corp.intel.com">
      <div class="WordSection1">
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Since
            vectorizer should understand the existence of the
            optimization, it can certainly be arm-twisted to<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">generate
            the IR desired by the Target. However, whether we want to do
            that is a totally different story.<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Vectorizer
            should focus on having reasonable cost model and generating
            straight-forward optimizable IR<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">----
            as opposed to generating convoluted IR (such as breaking up
            unit-stride load into even/odd, simply<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">to
            put them back to unit-stride again) wanted by the Target.<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">My
            recommendation is first analyzing the source of the current
            code generation deficiencies and then<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">try
            to remedy it there.<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Thanks,<o:p></o:p></span></p>
        <p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Hideki<o:p></o:p></span></p>
        <p class="MsoNormal"><a name="_MailEndCompose"
            moz-do-not-send="true"><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></a></p>
        <p class="MsoNormal"><a name="_____replyseparator"
            moz-do-not-send="true"></a><b><span
              style="font-size:11.0pt;font-family:"Calibri",sans-serif">From:</span></b><span
style="font-size:11.0pt;font-family:"Calibri",sans-serif">
            Craig Topper [<a class="moz-txt-link-freetext" href="mailto:craig.topper@gmail.com">mailto:craig.topper@gmail.com</a>]
            <br>
            <b>Sent:</b> Monday, July 23, 2018 4:37 PM<br>
            <b>To:</b> Hal Finkel <a class="moz-txt-link-rfc2396E" href="mailto:hfinkel@anl.gov"><hfinkel@anl.gov></a><br>
            <b>Cc:</b> Saito, Hideki <a class="moz-txt-link-rfc2396E" href="mailto:hideki.saito@intel.com"><hideki.saito@intel.com></a>;
            <a class="moz-txt-link-abbreviated" href="mailto:estotzer@ti.com">estotzer@ti.com</a>; Nemanja Ivanovic
            <a class="moz-txt-link-rfc2396E" href="mailto:nemanja.i.ibm@gmail.com"><nemanja.i.ibm@gmail.com></a>; Adam Nemet
            <a class="moz-txt-link-rfc2396E" href="mailto:anemet@apple.com"><anemet@apple.com></a>; <a class="moz-txt-link-abbreviated" href="mailto:graham.hunter@arm.com">graham.hunter@arm.com</a>; Michael
            Kuperstein <a class="moz-txt-link-rfc2396E" href="mailto:mkuper@google.com"><mkuper@google.com></a>; Sanjay Patel
            <a class="moz-txt-link-rfc2396E" href="mailto:spatel@rotateright.com"><spatel@rotateright.com></a>; Simon Pilgrim
            <a class="moz-txt-link-rfc2396E" href="mailto:llvm-dev@redking.me.uk"><llvm-dev@redking.me.uk></a>; <a class="moz-txt-link-abbreviated" href="mailto:ashutosh.nema@amd.com">ashutosh.nema@amd.com</a>;
            llvm-dev <a class="moz-txt-link-rfc2396E" href="mailto:llvm-dev@lists.llvm.org"><llvm-dev@lists.llvm.org></a><br>
            <b>Subject:</b> Re: [LoopVectorizer] Improving the
            performance of dot product reduction loop<o:p></o:p></span></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div>
          <p class="MsoNormal"><br clear="all">
            <o:p></o:p></p>
          <div>
            <div>
              <p class="MsoNormal">~Craig<o:p></o:p></p>
            </div>
          </div>
          <p class="MsoNormal" style="margin-bottom:12.0pt"><o:p> </o:p></p>
          <div>
            <div>
              <p class="MsoNormal">On Mon, Jul 23, 2018 at 4:24 PM Hal
                Finkel <<a href="mailto:hfinkel@anl.gov"
                  target="_blank" moz-do-not-send="true">hfinkel@anl.gov</a>>
                wrote:<o:p></o:p></p>
            </div>
            <blockquote style="border:none;border-left:solid #CCCCCC
              1.0pt;padding:0in 0in 0in
              6.0pt;margin-left:4.8pt;margin-right:0in">
              <div>
                <p class="MsoNormal"><o:p> </o:p></p>
                <div>
                  <p class="MsoNormal">On 07/23/2018 05:22 PM, Craig
                    Topper wrote:<o:p></o:p></p>
                </div>
                <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
                  <div>
                    <div>
                      <p class="MsoNormal">Hello all,<o:p></o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal"><o:p> </o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal">This code <a
                          href="https://godbolt.org/g/tTyxpf"
                          target="_blank" moz-do-not-send="true">
                          https://godbolt.org/g/tTyxpf</a> is a dot
                        product reduction loop multipying sign extended
                        16-bit values to produce a 32-bit accumulated
                        result. The x86 backend is currently not able to
                        optimize it as well as gcc and icc. The IR we
                        are getting from the loop vectorizer has several
                        v8i32 adds and muls inside the loop. These are
                        fed by v8i16 loads and sexts from v8i16 to
                        v8i32. The x86 backend recognizes that these are
                        addition reductions of multiplication so we use
                        the vpmaddwd instruction which calculates 32-bit
                        products from 16-bit inputs and does a
                        horizontal add of adjacent pairs. A vpmaddwd
                        given two v8i16 inputs will produce a v4i32
                        result.<o:p></o:p></p>
                    </div>
                  </div>
                </blockquote>
              </div>
            </blockquote>
            <div>
              <p class="MsoNormal"><o:p> </o:p></p>
            </div>
            <div>
              <p class="MsoNormal">That godbolt link seems wrong. It
                wasn't supposed to be clang IR. This should be right.<o:p></o:p></p>
            </div>
            <div>
              <p class="MsoNormal"> <o:p></o:p></p>
            </div>
            <blockquote style="border:none;border-left:solid #CCCCCC
              1.0pt;padding:0in 0in 0in
              6.0pt;margin-left:4.8pt;margin-right:0in">
              <div>
                <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
                  <div>
                    <div>
                      <p class="MsoNormal"><o:p> </o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal">In the example code, because
                        we are reducing the number of elements from
                        8->4 in the vpmaddwd step we are left with a
                        width mismatch between vpmaddwd and the vpaddd
                        instruction that we use to sum with the results
                        from the previous loop iterations. We rely on
                        the fact that a 128-bit vpmaddwd zeros the upper
                        bits of the register so that we can use a
                        256-bit vpaddd instruction so that the upper
                        elements can keep going around the loop without
                        being disturbed in case they weren't initialized
                        to 0. But this still means the vpmaddwd
                        instruction is doing half the amount of work the
                        CPU is capable of if we had been able to use a
                        256-bit vpmaddwd instruction. Additionally,
                        future x86 CPUs will be gaining an instruction
                        that can do VPMADDWD and VPADDD in one
                        instruction, but that width mismatch makes that
                        instruction difficult to utilize.<o:p></o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal"><o:p> </o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal">In order for the backend to
                        handle this better it would be great if we could
                        have something like two v32i8 loads, two
                        shufflevectors to extract the even elements and
                        the odd elements to create four v16i8 pieces.<o:p></o:p></p>
                    </div>
                  </div>
                </blockquote>
                <p class="MsoNormal"><br>
                  Why v*i8 loads? I thought that we have 16-bit and
                  32-bit types here?<o:p></o:p></p>
              </div>
            </blockquote>
            <div>
              <p class="MsoNormal"><o:p> </o:p></p>
            </div>
            <div>
              <p class="MsoNormal">Oops that should have been v16i16.
                Mixed up my 256-bit types.<o:p></o:p></p>
            </div>
            <div>
              <p class="MsoNormal"> <o:p></o:p></p>
            </div>
            <blockquote style="border:none;border-left:solid #CCCCCC
              1.0pt;padding:0in 0in 0in
              6.0pt;margin-left:4.8pt;margin-right:0in">
              <div>
                <p class="MsoNormal"><br>
                  <br>
                  <o:p></o:p></p>
                <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
                  <div>
                    <div>
                      <p class="MsoNormal">Sign extend each of those
                        pieces. Multiply the two even pieces and the two
                        odd pieces separately, sum those results with a
                        v8i32 add. Then another v8i32 add to accumulate
                        the previous loop iterations. Then ensures that
                        no pieces exceed the target vector width and the
                        final operation is correctly sized to go around
                        the loop in one register. All but the last add
                        can then be pattern matched to vpmaddwd as
                        proposed in <a
                          href="https://reviews.llvm.org/D49636"
                          target="_blank" moz-do-not-send="true">https://reviews.llvm.org/D49636</a>.
                        And for the future CPU the whole thing can be
                        matched to the new instruction.<o:p></o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal"><o:p> </o:p></p>
                    </div>
                    <div>
                      <p class="MsoNormal">Do other targets have a
                        similar instruction or a similar issue to this?
                        Is this something we can solve in the loop
                        vectorizer? Or should we have a separate IR
                        transformation that can recognize this pattern
                        and generate the new sequence? As a separate
                        pass we would need to pair two vector loads
                        together, remove a reduction step outside the
                        loop and remove half the phis assuming the loop
                        was partially unrolled. Or if there was only one
                        add/mul inside the loop we'd have to reduce its
                        width and the width of the phi.<o:p></o:p></p>
                    </div>
                  </div>
                </blockquote>
                <p class="MsoNormal"><br>
                  Can you explain how the desired code from the
                  vectorizer differs from the code that the vectorizer
                  produces if you add '#pragma clang loop
                  vectorize(enable) vectorize_width(16)'  above the
                  loop? I tried it in your godbolt example and the
                  generated code looks very similar to the icc-generated
                  code.<o:p></o:p></p>
              </div>
            </blockquote>
            <div>
              <p class="MsoNormal"><o:p> </o:p></p>
            </div>
            <div>
              <p class="MsoNormal">It's similar, but the vpxor %xmm0,
                %xmm0, %xmm0 is being unnecessarily carried across the
                loop. It's then redundantly added twice in the reduction
                after the loop despite it being 0. This happens because
                we basically tricked the backend into generating a
                256-bit vpmaddwd concated with a 256-bit zero vector
                going into a 512-bit vaddd before type legalization. The
                512-bit concat and vpaddd get split during type
                legalization, and the high half of the add gets constant
                folded away. I'm guessing we probably finished with 4
                vpxors before the loop but MachineCSE(or some other
                pass?) combined two of them when it figured out the loop
                didn't modify them.<o:p></o:p></p>
            </div>
            <div>
              <p class="MsoNormal"> <o:p></o:p></p>
            </div>
            <blockquote style="border:none;border-left:solid #CCCCCC
              1.0pt;padding:0in 0in 0in
              6.0pt;margin-left:4.8pt;margin-right:0in">
              <div>
                <p class="MsoNormal"><br>
                  Thanks again,<br>
                  Hal<br>
                  <br>
                  <br>
                  <o:p></o:p></p>
                <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
                  <div>
                    <div>
                      <p class="MsoNormal"><o:p> </o:p></p>
                    </div>
                    <p class="MsoNormal">Thanks,<br clear="all">
                      <o:p></o:p></p>
                    <div>
                      <div>
                        <p class="MsoNormal">~Craig<o:p></o:p></p>
                      </div>
                    </div>
                  </div>
                </blockquote>
                <p class="MsoNormal"><br>
                  <br>
                  <o:p></o:p></p>
                <pre>-- <o:p></o:p></pre>
                <pre>Hal Finkel<o:p></o:p></pre>
                <pre>Lead, Compiler Technology and Programming Languages<o:p></o:p></pre>
                <pre>Leadership Computing Facility<o:p></o:p></pre>
                <pre>Argonne National Laboratory<o:p></o:p></pre>
              </div>
            </blockquote>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
  </body>
</html>