<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p><br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 07/24/2018 12:07 PM, Craig Topper
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAF7ks-P7fzhig6N8UWzFPyMDn8X5wRfvFLaUPjNxXrFjgVEweQ@mail.gmail.com">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <div dir="ltr">With maximize-bandwidth I'd still end up with the
        extra vpxors above the loop and the extra addition reduction
        steps at the end that we get from forcing the vf to 16 right?</div>
    </blockquote>
    <br>
    Yea, I think we'd need something else to help with that. My
    underlying thought here is that, based on your description, we
    really do want a VF of 16 (because we want to use 256-bit loads,
    etc.). And so, when thinking about how to fix things, we should
    start with looking at the VF = 16 output, and not the VF = 8 output,
    as the right starting point for the backend.<br>
    <br>
     -Hal<br>
    <br>
    <blockquote type="cite"
cite="mid:CAF7ks-P7fzhig6N8UWzFPyMDn8X5wRfvFLaUPjNxXrFjgVEweQ@mail.gmail.com">
      <div dir="ltr">
        <div><br clear="all">
          <div>
            <div dir="ltr" class="gmail_signature"
              data-smartmail="gmail_signature">~Craig</div>
          </div>
          <br>
        </div>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr">On Tue, Jul 24, 2018 at 10:04 AM Craig Topper
          <<a href="mailto:craig.topper@gmail.com"
            moz-do-not-send="true">craig.topper@gmail.com</a>> wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0 0 0
          .8ex;border-left:1px #ccc solid;padding-left:1ex">
          <div dir="ltr"><br>
            <br>
            <div class="gmail_quote">
              <div dir="ltr">On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel
                <<a href="mailto:hfinkel@anl.gov" target="_blank"
                  moz-do-not-send="true">hfinkel@anl.gov</a>> wrote:<br>
              </div>
              <blockquote class="gmail_quote" style="margin:0px 0px 0px
                0.8ex;border-left:1px solid
                rgb(204,204,204);padding-left:1ex">
                <div bgcolor="#FFFFFF"> <br>
                  <div
                    class="m_-1066914453783556908gmail-m_5656040821941643324moz-cite-prefix">On
                    07/23/2018 06:37 PM, Craig Topper wrote:<br>
                  </div>
                  <blockquote type="cite">
                    <div dir="ltr"><br clear="all">
                      <div>
                        <div dir="ltr"
class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104gmail_signature">~Craig</div>
                      </div>
                      <br>
                      <br>
                      <div class="gmail_quote">
                        <div dir="ltr">On Mon, Jul 23, 2018 at 4:24 PM
                          Hal Finkel <<a
                            href="mailto:hfinkel@anl.gov"
                            target="_blank" moz-do-not-send="true">hfinkel@anl.gov</a>>
                          wrote:<br>
                        </div>
                        <blockquote class="gmail_quote"
                          style="margin:0px 0px 0px
                          0.8ex;border-left:1px solid
                          rgb(204,204,204);padding-left:1ex">
                          <div bgcolor="#FFFFFF"> <br>
                            <div
class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104m_7042871795306411262moz-cite-prefix">On
                              07/23/2018 05:22 PM, Craig Topper wrote:<br>
                            </div>
                            <blockquote type="cite">
                              <div dir="ltr">
                                <div>Hello all,</div>
                                <div><br>
                                </div>
                                <div>This code <a
                                    href="https://godbolt.org/g/tTyxpf"
                                    target="_blank"
                                    moz-do-not-send="true">https://godbolt.org/g/tTyxpf</a> is
                                  a dot product reduction loop
                                  multipying sign extended 16-bit values
                                  to produce a 32-bit accumulated
                                  result. The x86 backend is currently
                                  not able to optimize it as well as gcc
                                  and icc. The IR we are getting from
                                  the loop vectorizer has several v8i32
                                  adds and muls inside the loop. These
                                  are fed by v8i16 loads and sexts from
                                  v8i16 to v8i32. The x86 backend
                                  recognizes that these are addition
                                  reductions of multiplication so we use
                                  the vpmaddwd instruction which
                                  calculates 32-bit products from 16-bit
                                  inputs and does a horizontal add of
                                  adjacent pairs. A vpmaddwd given two
                                  v8i16 inputs will produce a v4i32
                                  result.</div>
                              </div>
                            </blockquote>
                          </div>
                        </blockquote>
                        <div><br>
                        </div>
                        <div>That godbolt link seems wrong. It wasn't
                          supposed to be clang IR. This should be right.</div>
                        <div> </div>
                        <blockquote class="gmail_quote"
                          style="margin:0px 0px 0px
                          0.8ex;border-left:1px solid
                          rgb(204,204,204);padding-left:1ex">
                          <div bgcolor="#FFFFFF">
                            <blockquote type="cite">
                              <div dir="ltr">
                                <div><br>
                                </div>
                                <div>In the example code, because we are
                                  reducing the number of elements from
                                  8->4 in the vpmaddwd step we are
                                  left with a width mismatch between
                                  vpmaddwd and the vpaddd instruction
                                  that we use to sum with the results
                                  from the previous loop iterations. We
                                  rely on the fact that a 128-bit
                                  vpmaddwd zeros the upper bits of the
                                  register so that we can use a 256-bit
                                  vpaddd instruction so that the upper
                                  elements can keep going around the
                                  loop without being disturbed in case
                                  they weren't initialized to 0. But
                                  this still means the vpmaddwd
                                  instruction is doing half the amount
                                  of work the CPU is capable of if we
                                  had been able to use a 256-bit
                                  vpmaddwd instruction. Additionally,
                                  future x86 CPUs will be gaining an
                                  instruction that can do VPMADDWD and
                                  VPADDD in one instruction, but that
                                  width mismatch makes that instruction
                                  difficult to utilize.</div>
                                <div><br>
                                </div>
                                <div>In order for the backend to handle
                                  this better it would be great if we
                                  could have something like two v32i8
                                  loads, two shufflevectors to extract
                                  the even elements and the odd elements
                                  to create four v16i8 pieces.</div>
                              </div>
                            </blockquote>
                            <br>
                            Why v*i8 loads? I thought that we have
                            16-bit and 32-bit types here?<br>
                          </div>
                        </blockquote>
                        <div><br>
                        </div>
                        <div>Oops that should have been v16i16. Mixed up
                          my 256-bit types.</div>
                        <div> </div>
                        <blockquote class="gmail_quote"
                          style="margin:0px 0px 0px
                          0.8ex;border-left:1px solid
                          rgb(204,204,204);padding-left:1ex">
                          <div bgcolor="#FFFFFF"> <br>
                            <blockquote type="cite">
                              <div dir="ltr">
                                <div>Sign extend each of those pieces.
                                  Multiply the two even pieces and the
                                  two odd pieces separately, sum those
                                  results with a v8i32 add. Then another
                                  v8i32 add to accumulate the previous
                                  loop iterations.</div>
                              </div>
                            </blockquote>
                          </div>
                        </blockquote>
                      </div>
                    </div>
                  </blockquote>
                  <br>
                  I'm still missing something. Why do you want to
                  separate out the even and odd parts instead of just
                  adding up the first half of the numbers and the second
                  half?<br>
                </div>
              </blockquote>
              <div><br>
              </div>
              <div>Doing even/odd matches up with a pattern I already
                have to support for the code in <a
                  href="https://reviews.llvm.org/D49636" target="_blank"
                  moz-do-not-send="true">https://reviews.llvm.org/D49636</a>.
                I wouldn't even need to detect is as a reduction to do
                the reassocation since even/odd exactly matches the
                behavior of the instruction. But you're right we could
                also just detect the reduction and add two halves.</div>
              <div><br>
              </div>
              <div> </div>
              <blockquote class="gmail_quote" style="margin:0px 0px 0px
                0.8ex;border-left:1px solid
                rgb(204,204,204);padding-left:1ex">
                <div bgcolor="#FFFFFF"> <br>
                  Thanks again,<br>
                  Hal<br>
                  <br>
                  <blockquote type="cite">
                    <div dir="ltr">
                      <div class="gmail_quote">
                        <blockquote class="gmail_quote"
                          style="margin:0px 0px 0px
                          0.8ex;border-left:1px solid
                          rgb(204,204,204);padding-left:1ex">
                          <div bgcolor="#FFFFFF">
                            <blockquote type="cite">
                              <div dir="ltr">
                                <div> Then ensures that no pieces exceed
                                  the target vector width and the final
                                  operation is correctly sized to go
                                  around the loop in one register. All
                                  but the last add can then be pattern
                                  matched to vpmaddwd as proposed in <a
href="https://reviews.llvm.org/D49636" target="_blank"
                                    moz-do-not-send="true">https://reviews.llvm.org/D49636</a>.
                                  And for the future CPU the whole thing
                                  can be matched to the new instruction.<br>
                                </div>
                                <div><br>
                                </div>
                                <div>Do other targets have a similar
                                  instruction or a similar issue to
                                  this? Is this something we can solve
                                  in the loop vectorizer? Or should we
                                  have a separate IR transformation that
                                  can recognize this pattern and
                                  generate the new sequence? As a
                                  separate pass we would need to pair
                                  two vector loads together, remove a
                                  reduction step outside the loop and
                                  remove half the phis assuming the loop
                                  was partially unrolled. Or if there
                                  was only one add/mul inside the loop
                                  we'd have to reduce its width and the
                                  width of the phi.</div>
                              </div>
                            </blockquote>
                            <br>
                            Can you explain how the desired code from
                            the vectorizer differs from the code that
                            the vectorizer produces if you add '#pragma
                            clang loop vectorize(enable)
                            vectorize_width(16)'  above the loop? I
                            tried it in your godbolt example and the
                            generated code looks very similar to the
                            icc-generated code.<br>
                          </div>
                        </blockquote>
                        <div><br>
                        </div>
                        <div>It's similar, but the vpxor %xmm0, %xmm0,
                          %xmm0 is being unnecessarily carried across
                          the loop. It's then redundantly added twice in
                          the reduction after the loop despite it being
                          0. This happens because we basically tricked
                          the backend into generating a 256-bit vpmaddwd
                          concated with a 256-bit zero vector going into
                          a 512-bit vaddd before type legalization. The
                          512-bit concat and vpaddd get split during
                          type legalization, and the high half of the
                          add gets constant folded away. I'm guessing we
                          probably finished with 4 vpxors before the
                          loop but MachineCSE(or some other pass?)
                          combined two of them when it figured out the
                          loop didn't modify them.</div>
                        <div> </div>
                        <blockquote class="gmail_quote"
                          style="margin:0px 0px 0px
                          0.8ex;border-left:1px solid
                          rgb(204,204,204);padding-left:1ex">
                          <div bgcolor="#FFFFFF"> <br>
                            Thanks again,<br>
                            Hal<br>
                            <br>
                            <blockquote type="cite">
                              <div dir="ltr">
                                <div><br>
                                </div>
                                Thanks,<br clear="all">
                                <div>
                                  <div dir="ltr"
class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104m_7042871795306411262gmail-m_-1264842386137689721gmail_signature">~Craig</div>
                                </div>
                              </div>
                            </blockquote>
                            <br>
                            <pre class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104m_7042871795306411262moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
                          </div>
                        </blockquote>
                      </div>
                    </div>
                  </blockquote>
                  <br>
                  <pre class="m_-1066914453783556908gmail-m_5656040821941643324moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
                </div>
              </blockquote>
            </div>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
  </body>
</html>