<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p><br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 07/24/2018 12:07 PM, Craig Topper

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAF7ks-P7fzhig6N8UWzFPyMDn8X5wRfvFLaUPjNxXrFjgVEweQ@mail.gmail.com">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <div dir="ltr">With maximize-bandwidth I'd still end up with the

        extra vpxors above the loop and the extra addition reduction

        steps at the end that we get from forcing the vf to 16 right?</div>

    </blockquote>

    <br>

    Yea, I think we'd need something else to help with that. My

    underlying thought here is that, based on your description, we

    really do want a VF of 16 (because we want to use 256-bit loads,

    etc.). And so, when thinking about how to fix things, we should

    start with looking at the VF = 16 output, and not the VF = 8 output,

    as the right starting point for the backend.<br>

    <br>

     -Hal<br>

    <br>

    <blockquote type="cite"

cite="mid:CAF7ks-P7fzhig6N8UWzFPyMDn8X5wRfvFLaUPjNxXrFjgVEweQ@mail.gmail.com">

      <div dir="ltr">

        <div><br clear="all">

          <div>

            <div dir="ltr" class="gmail_signature"

              data-smartmail="gmail_signature">~Craig</div>

          </div>

          <br>

        </div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr">On Tue, Jul 24, 2018 at 10:04 AM Craig Topper

          <<a href="mailto:craig.topper@gmail.com"

            moz-do-not-send="true">craig.topper@gmail.com</a>> wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">

          <div dir="ltr"><br>

            <br>

            <div class="gmail_quote">

              <div dir="ltr">On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel

                <<a href="mailto:hfinkel@anl.gov" target="_blank"

                  moz-do-not-send="true">hfinkel@anl.gov</a>> wrote:<br>

              </div>

              <blockquote class="gmail_quote" style="margin:0px 0px 0px

                0.8ex;border-left:1px solid

                rgb(204,204,204);padding-left:1ex">

                <div bgcolor="#FFFFFF"> <br>

                  <div

                    class="m_-1066914453783556908gmail-m_5656040821941643324moz-cite-prefix">On

                    07/23/2018 06:37 PM, Craig Topper wrote:<br>

                  </div>

                  <blockquote type="cite">

                    <div dir="ltr"><br clear="all">

                      <div>

                        <div dir="ltr"

class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104gmail_signature">~Craig</div>

                      </div>

                      <br>

                      <br>

                      <div class="gmail_quote">

                        <div dir="ltr">On Mon, Jul 23, 2018 at 4:24 PM

                          Hal Finkel <<a

                            href="mailto:hfinkel@anl.gov"

                            target="_blank" moz-do-not-send="true">hfinkel@anl.gov</a>>

                          wrote:<br>

                        </div>

                        <blockquote class="gmail_quote"

                          style="margin:0px 0px 0px

                          0.8ex;border-left:1px solid

                          rgb(204,204,204);padding-left:1ex">

                          <div bgcolor="#FFFFFF"> <br>

                            <div

class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104m_7042871795306411262moz-cite-prefix">On

                              07/23/2018 05:22 PM, Craig Topper wrote:<br>

                            </div>

                            <blockquote type="cite">

                              <div dir="ltr">

                                <div>Hello all,</div>

                                <div><br>

                                </div>

                                <div>This code <a

                                    href="https://godbolt.org/g/tTyxpf"

                                    target="_blank"

                                    moz-do-not-send="true">https://godbolt.org/g/tTyxpf</a> is

                                  a dot product reduction loop

                                  multipying sign extended 16-bit values

                                  to produce a 32-bit accumulated

                                  result. The x86 backend is currently

                                  not able to optimize it as well as gcc

                                  and icc. The IR we are getting from

                                  the loop vectorizer has several v8i32

                                  adds and muls inside the loop. These

                                  are fed by v8i16 loads and sexts from

                                  v8i16 to v8i32. The x86 backend

                                  recognizes that these are addition

                                  reductions of multiplication so we use

                                  the vpmaddwd instruction which

                                  calculates 32-bit products from 16-bit

                                  inputs and does a horizontal add of

                                  adjacent pairs. A vpmaddwd given two

                                  v8i16 inputs will produce a v4i32

                                  result.</div>

                              </div>

                            </blockquote>

                          </div>

                        </blockquote>

                        <div><br>

                        </div>

                        <div>That godbolt link seems wrong. It wasn't

                          supposed to be clang IR. This should be right.</div>

                        <div> </div>

                        <blockquote class="gmail_quote"

                          style="margin:0px 0px 0px

                          0.8ex;border-left:1px solid

                          rgb(204,204,204);padding-left:1ex">

                          <div bgcolor="#FFFFFF">

                            <blockquote type="cite">

                              <div dir="ltr">

                                <div><br>

                                </div>

                                <div>In the example code, because we are

                                  reducing the number of elements from

                                  8->4 in the vpmaddwd step we are

                                  left with a width mismatch between

                                  vpmaddwd and the vpaddd instruction

                                  that we use to sum with the results

                                  from the previous loop iterations. We

                                  rely on the fact that a 128-bit

                                  vpmaddwd zeros the upper bits of the

                                  register so that we can use a 256-bit

                                  vpaddd instruction so that the upper

                                  elements can keep going around the

                                  loop without being disturbed in case

                                  they weren't initialized to 0. But

                                  this still means the vpmaddwd

                                  instruction is doing half the amount

                                  of work the CPU is capable of if we

                                  had been able to use a 256-bit

                                  vpmaddwd instruction. Additionally,

                                  future x86 CPUs will be gaining an

                                  instruction that can do VPMADDWD and

                                  VPADDD in one instruction, but that

                                  width mismatch makes that instruction

                                  difficult to utilize.</div>

                                <div><br>

                                </div>

                                <div>In order for the backend to handle

                                  this better it would be great if we

                                  could have something like two v32i8

                                  loads, two shufflevectors to extract

                                  the even elements and the odd elements

                                  to create four v16i8 pieces.</div>

                              </div>

                            </blockquote>

                            <br>

                            Why v*i8 loads? I thought that we have

                            16-bit and 32-bit types here?<br>

                          </div>

                        </blockquote>

                        <div><br>

                        </div>

                        <div>Oops that should have been v16i16. Mixed up

                          my 256-bit types.</div>

                        <div> </div>

                        <blockquote class="gmail_quote"

                          style="margin:0px 0px 0px

                          0.8ex;border-left:1px solid

                          rgb(204,204,204);padding-left:1ex">

                          <div bgcolor="#FFFFFF"> <br>

                            <blockquote type="cite">

                              <div dir="ltr">

                                <div>Sign extend each of those pieces.

                                  Multiply the two even pieces and the

                                  two odd pieces separately, sum those

                                  results with a v8i32 add. Then another

                                  v8i32 add to accumulate the previous

                                  loop iterations.</div>

                              </div>

                            </blockquote>

                          </div>

                        </blockquote>

                      </div>

                    </div>

                  </blockquote>

                  <br>

                  I'm still missing something. Why do you want to

                  separate out the even and odd parts instead of just

                  adding up the first half of the numbers and the second

                  half?<br>

                </div>

              </blockquote>

              <div><br>

              </div>

              <div>Doing even/odd matches up with a pattern I already

                have to support for the code in <a

                  href="https://reviews.llvm.org/D49636" target="_blank"

                  moz-do-not-send="true">https://reviews.llvm.org/D49636</a>.

                I wouldn't even need to detect is as a reduction to do

                the reassocation since even/odd exactly matches the

                behavior of the instruction. But you're right we could

                also just detect the reduction and add two halves.</div>

              <div><br>

              </div>

              <div> </div>

              <blockquote class="gmail_quote" style="margin:0px 0px 0px

                0.8ex;border-left:1px solid

                rgb(204,204,204);padding-left:1ex">

                <div bgcolor="#FFFFFF"> <br>

                  Thanks again,<br>

                  Hal<br>

                  <br>

                  <blockquote type="cite">

                    <div dir="ltr">

                      <div class="gmail_quote">

                        <blockquote class="gmail_quote"

                          style="margin:0px 0px 0px

                          0.8ex;border-left:1px solid

                          rgb(204,204,204);padding-left:1ex">

                          <div bgcolor="#FFFFFF">

                            <blockquote type="cite">

                              <div dir="ltr">

                                <div> Then ensures that no pieces exceed

                                  the target vector width and the final

                                  operation is correctly sized to go

                                  around the loop in one register. All

                                  but the last add can then be pattern

                                  matched to vpmaddwd as proposed in <a

href="https://reviews.llvm.org/D49636" target="_blank"

                                    moz-do-not-send="true">https://reviews.llvm.org/D49636</a>.

                                  And for the future CPU the whole thing

                                  can be matched to the new instruction.<br>

                                </div>

                                <div><br>

                                </div>

                                <div>Do other targets have a similar

                                  instruction or a similar issue to

                                  this? Is this something we can solve

                                  in the loop vectorizer? Or should we

                                  have a separate IR transformation that

                                  can recognize this pattern and

                                  generate the new sequence? As a

                                  separate pass we would need to pair

                                  two vector loads together, remove a

                                  reduction step outside the loop and

                                  remove half the phis assuming the loop

                                  was partially unrolled. Or if there

                                  was only one add/mul inside the loop

                                  we'd have to reduce its width and the

                                  width of the phi.</div>

                              </div>

                            </blockquote>

                            <br>

                            Can you explain how the desired code from

                            the vectorizer differs from the code that

                            the vectorizer produces if you add '#pragma

                            clang loop vectorize(enable)

                            vectorize_width(16)'  above the loop? I

                            tried it in your godbolt example and the

                            generated code looks very similar to the

                            icc-generated code.<br>

                          </div>

                        </blockquote>

                        <div><br>

                        </div>

                        <div>It's similar, but the vpxor %xmm0, %xmm0,

                          %xmm0 is being unnecessarily carried across

                          the loop. It's then redundantly added twice in

                          the reduction after the loop despite it being

                          0. This happens because we basically tricked

                          the backend into generating a 256-bit vpmaddwd

                          concated with a 256-bit zero vector going into

                          a 512-bit vaddd before type legalization. The

                          512-bit concat and vpaddd get split during

                          type legalization, and the high half of the

                          add gets constant folded away. I'm guessing we

                          probably finished with 4 vpxors before the

                          loop but MachineCSE(or some other pass?)

                          combined two of them when it figured out the

                          loop didn't modify them.</div>

                        <div> </div>

                        <blockquote class="gmail_quote"

                          style="margin:0px 0px 0px

                          0.8ex;border-left:1px solid

                          rgb(204,204,204);padding-left:1ex">

                          <div bgcolor="#FFFFFF"> <br>

                            Thanks again,<br>

                            Hal<br>

                            <br>

                            <blockquote type="cite">

                              <div dir="ltr">

                                <div><br>

                                </div>

                                Thanks,<br clear="all">

                                <div>

                                  <div dir="ltr"

class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104m_7042871795306411262gmail-m_-1264842386137689721gmail_signature">~Craig</div>

                                </div>

                              </div>

                            </blockquote>

                            <br>

                            <pre class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104m_7042871795306411262moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

                          </div>

                        </blockquote>

                      </div>

                    </div>

                  </blockquote>

                  <br>

                  <pre class="m_-1066914453783556908gmail-m_5656040821941643324moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

                </div>

              </blockquote>

            </div>

          </div>

        </blockquote>

      </div>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

  </body>

</html>