<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <br>

    <div class="moz-cite-prefix">On 07/23/2018 06:37 PM, Craig Topper

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAF7ks-NkfxRcBoN5VoPFrJhn36TJPptFYy4Y6o6myt+Awe9A0g@mail.gmail.com">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <div dir="ltr"><br clear="all">

        <div>

          <div dir="ltr" class="m_-4456769814224348104gmail_signature"

            data-smartmail="gmail_signature">~Craig</div>

        </div>

        <br>

        <br>

        <div class="gmail_quote">

          <div dir="ltr">On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <<a

              href="mailto:hfinkel@anl.gov" target="_blank"

              moz-do-not-send="true">hfinkel@anl.gov</a>> wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF"> <br>

              <div

                class="m_-4456769814224348104m_7042871795306411262moz-cite-prefix">On

                07/23/2018 05:22 PM, Craig Topper wrote:<br>

              </div>

              <blockquote type="cite">

                <div dir="ltr">

                  <div>Hello all,</div>

                  <div><br>

                  </div>

                  <div>This code <a href="https://godbolt.org/g/tTyxpf"

                      target="_blank" moz-do-not-send="true">https://godbolt.org/g/tTyxpf</a> is

                    a dot product reduction loop multipying sign

                    extended 16-bit values to produce a 32-bit

                    accumulated result. The x86 backend is currently not

                    able to optimize it as well as gcc and icc. The IR

                    we are getting from the loop vectorizer has several

                    v8i32 adds and muls inside the loop. These are fed

                    by v8i16 loads and sexts from v8i16 to v8i32. The

                    x86 backend recognizes that these are addition

                    reductions of multiplication so we use the vpmaddwd

                    instruction which calculates 32-bit products from

                    16-bit inputs and does a horizontal add of adjacent

                    pairs. A vpmaddwd given two v8i16 inputs will

                    produce a v4i32 result.</div>

                </div>

              </blockquote>

            </div>

          </blockquote>

          <div><br>

          </div>

          <div>That godbolt link seems wrong. It wasn't supposed to be

            clang IR. This should be right.</div>

          <div> </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF">

              <blockquote type="cite">

                <div dir="ltr">

                  <div><br>

                  </div>

                  <div>In the example code, because we are reducing the

                    number of elements from 8->4 in the vpmaddwd step

                    we are left with a width mismatch between vpmaddwd

                    and the vpaddd instruction that we use to sum with

                    the results from the previous loop iterations. We

                    rely on the fact that a 128-bit vpmaddwd zeros the

                    upper bits of the register so that we can use a

                    256-bit vpaddd instruction so that the upper

                    elements can keep going around the loop without

                    being disturbed in case they weren't initialized to

                    0. But this still means the vpmaddwd instruction is

                    doing half the amount of work the CPU is capable of

                    if we had been able to use a 256-bit vpmaddwd

                    instruction. Additionally, future x86 CPUs will be

                    gaining an instruction that can do VPMADDWD and

                    VPADDD in one instruction, but that width mismatch

                    makes that instruction difficult to utilize.</div>

                  <div><br>

                  </div>

                  <div>In order for the backend to handle this better it

                    would be great if we could have something like two

                    v32i8 loads, two shufflevectors to extract the even

                    elements and the odd elements to create four v16i8

                    pieces.</div>

                </div>

              </blockquote>

              <br>

              Why v*i8 loads? I thought that we have 16-bit and 32-bit

              types here?<br>

            </div>

          </blockquote>

          <div><br>

          </div>

          <div>Oops that should have been v16i16. Mixed up my 256-bit

            types.</div>

          <div> </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF"> <br>

              <blockquote type="cite">

                <div dir="ltr">

                  <div>Sign extend each of those pieces. Multiply the

                    two even pieces and the two odd pieces separately,

                    sum those results with a v8i32 add. Then another

                    v8i32 add to accumulate the previous loop

                    iterations.</div>

                </div>

              </blockquote>

            </div>

          </blockquote>

        </div>

      </div>

    </blockquote>

    <br>

    I'm still missing something. Why do you want to separate out the

    even and odd parts instead of just adding up the first half of the

    numbers and the second half?<br>

    <br>

    Thanks again,<br>

    Hal<br>

    <br>

    <blockquote type="cite"

cite="mid:CAF7ks-NkfxRcBoN5VoPFrJhn36TJPptFYy4Y6o6myt+Awe9A0g@mail.gmail.com">

      <div dir="ltr">

        <div class="gmail_quote">

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF">

              <blockquote type="cite">

                <div dir="ltr">

                  <div> Then ensures that no pieces exceed the target

                    vector width and the final operation is correctly

                    sized to go around the loop in one register. All but

                    the last add can then be pattern matched to vpmaddwd

                    as proposed in <a

                      href="https://reviews.llvm.org/D49636"

                      target="_blank" moz-do-not-send="true">https://reviews.llvm.org/D49636</a>.

                    And for the future CPU the whole thing can be

                    matched to the new instruction.<br>

                  </div>

                  <div><br>

                  </div>

                  <div>Do other targets have a similar instruction or a

                    similar issue to this? Is this something we can

                    solve in the loop vectorizer? Or should we have a

                    separate IR transformation that can recognize this

                    pattern and generate the new sequence? As a separate

                    pass we would need to pair two vector loads

                    together, remove a reduction step outside the loop

                    and remove half the phis assuming the loop was

                    partially unrolled. Or if there was only one add/mul

                    inside the loop we'd have to reduce its width and

                    the width of the phi.</div>

                </div>

              </blockquote>

              <br>

              Can you explain how the desired code from the vectorizer

              differs from the code that the vectorizer produces if you

              add '#pragma clang loop vectorize(enable)

              vectorize_width(16)'  above the loop? I tried it in your

              godbolt example and the generated code looks very similar

              to the icc-generated code.<br>

            </div>

          </blockquote>

          <div><br>

          </div>

          <div>It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being

            unnecessarily carried across the loop. It's then redundantly

            added twice in the reduction after the loop despite it being

            0. This happens because we basically tricked the backend

            into generating a 256-bit vpmaddwd concated with a 256-bit

            zero vector going into a 512-bit vaddd before type

            legalization. The 512-bit concat and vpaddd get split during

            type legalization, and the high half of the add gets

            constant folded away. I'm guessing we probably finished with

            4 vpxors before the loop but MachineCSE(or some other pass?)

            combined two of them when it figured out the loop didn't

            modify them.</div>

          <div> </div>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF"> <br>

              Thanks again,<br>

              Hal<br>

              <br>

              <blockquote type="cite">

                <div dir="ltr">

                  <div><br>

                  </div>

                  Thanks,<br clear="all">

                  <div>

                    <div dir="ltr"

class="m_-4456769814224348104m_7042871795306411262gmail-m_-1264842386137689721gmail_signature">~Craig</div>

                  </div>

                </div>

              </blockquote>

              <br>

              <pre class="m_-4456769814224348104m_7042871795306411262moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

            </div>

          </blockquote>

        </div>

      </div>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

  </body>

</html>