<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p><br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 07/23/2018 06:23 PM, Hal Finkel via

      llvm-dev wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:18f3ab75-600e-8b64-03b0-e7ac9e50c9cb@anl.gov">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <br>

      <div class="moz-cite-prefix">On 07/23/2018 05:22 PM, Craig Topper

        wrote:<br>

      </div>

      <blockquote type="cite"

cite="mid:CAF7ks-PmFE2PswYKW_0jDGLyZFCaOvD70sFAbvoJZ__X1aTJxw@mail.gmail.com">

        <div dir="ltr">

          <div>Hello all,</div>

          <div><br>

          </div>

          <div>This code <a href="https://godbolt.org/g/tTyxpf"

              target="_blank" moz-do-not-send="true">https://godbolt.org/g/tTyxpf</a> is

            a dot product reduction loop multipying sign extended 16-bit

            values to produce a 32-bit accumulated result. The x86

            backend is currently not able to optimize it as well as gcc

            and icc. The IR we are getting from the loop vectorizer has

            several v8i32 adds and muls inside the loop. These are fed

            by v8i16 loads and sexts from v8i16 to v8i32. The x86

            backend recognizes that these are addition reductions of

            multiplication so we use the vpmaddwd instruction which

            calculates 32-bit products from 16-bit inputs and does a

            horizontal add of adjacent pairs. A vpmaddwd given two v8i16

            inputs will produce a v4i32 result.</div>

          <div><br>

          </div>

          <div>In the example code, because we are reducing the number

            of elements from 8->4 in the vpmaddwd step we are left

            with a width mismatch between vpmaddwd and the vpaddd

            instruction that we use to sum with the results from the

            previous loop iterations. We rely on the fact that a 128-bit

            vpmaddwd zeros the upper bits of the register so that we can

            use a 256-bit vpaddd instruction so that the upper elements

            can keep going around the loop without being disturbed in

            case they weren't initialized to 0. But this still means the

            vpmaddwd instruction is doing half the amount of work the

            CPU is capable of if we had been able to use a 256-bit

            vpmaddwd instruction. Additionally, future x86 CPUs will be

            gaining an instruction that can do VPMADDWD and VPADDD in

            one instruction, but that width mismatch makes that

            instruction difficult to utilize.</div>

          <div><br>

          </div>

          <div>In order for the backend to handle this better it would

            be great if we could have something like two v32i8 loads,

            two shufflevectors to extract the even elements and the odd

            elements to create four v16i8 pieces.</div>

        </div>

      </blockquote>

      <br>

      Why v*i8 loads? I thought that we have 16-bit and 32-bit types

      here?<br>

      <br>

      <blockquote type="cite"

cite="mid:CAF7ks-PmFE2PswYKW_0jDGLyZFCaOvD70sFAbvoJZ__X1aTJxw@mail.gmail.com">

        <div dir="ltr">

          <div>Sign extend each of those pieces. Multiply the two even

            pieces and the two odd pieces separately, sum those results

            with a v8i32 add. Then another v8i32 add to accumulate the

            previous loop iterations. Then ensures that no pieces exceed

            the target vector width and the final operation is correctly

            sized to go around the loop in one register. All but the

            last add can then be pattern matched to vpmaddwd as proposed

            in <a href="https://reviews.llvm.org/D49636"

              moz-do-not-send="true">https://reviews.llvm.org/D49636</a>.

            And for the future CPU the whole thing can be matched to the

            new instruction.<br>

          </div>

          <div><br>

          </div>

          <div>Do other targets have a similar instruction or a similar

            issue to this? Is this something we can solve in the loop

            vectorizer? Or should we have a separate IR transformation

            that can recognize this pattern and generate the new

            sequence? As a separate pass we would need to pair two

            vector loads together, remove a reduction step outside the

            loop and remove half the phis assuming the loop was

            partially unrolled. Or if there was only one add/mul inside

            the loop we'd have to reduce its width and the width of the

            phi.</div>

        </div>

      </blockquote>

      <br>

      Can you explain how the desired code from the vectorizer differs

      from the code that the vectorizer produces if you add '#pragma

      clang loop vectorize(enable) vectorize_width(16)'  above the loop?

      I tried it in your godbolt example and the generated code looks

      very similar to the icc-generated code.<br>

    </blockquote>

    <br>

    (specifically, I mean this: <a class="moz-txt-link-freetext" href="https://godbolt.org/g/LJA38e">https://godbolt.org/g/LJA38e</a>)<br>

    <br>

    <blockquote type="cite"

      cite="mid:18f3ab75-600e-8b64-03b0-e7ac9e50c9cb@anl.gov"> <br>

      Thanks again,<br>

      Hal<br>

      <br>

      <blockquote type="cite"

cite="mid:CAF7ks-PmFE2PswYKW_0jDGLyZFCaOvD70sFAbvoJZ__X1aTJxw@mail.gmail.com">

        <div dir="ltr">

          <div><br>

          </div>

          Thanks,<br clear="all">

          <div>

            <div dir="ltr"

              class="gmail-m_-1264842386137689721gmail_signature">~Craig</div>

          </div>

        </div>

      </blockquote>

      <br>

      <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

      <!--'"--><br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

LLVM Developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory</pre>

  </body>

</html>