<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p><br>
</p>
<br>
<div class="moz-cite-prefix">On 07/23/2018 06:23 PM, Hal Finkel via
llvm-dev wrote:<br>
</div>
<blockquote type="cite"
cite="mid:18f3ab75-600e-8b64-03b0-e7ac9e50c9cb@anl.gov">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<br>
<div class="moz-cite-prefix">On 07/23/2018 05:22 PM, Craig Topper
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAF7ks-PmFE2PswYKW_0jDGLyZFCaOvD70sFAbvoJZ__X1aTJxw@mail.gmail.com">
<div dir="ltr">
<div>Hello all,</div>
<div><br>
</div>
<div>This code <a href="https://godbolt.org/g/tTyxpf"
target="_blank" moz-do-not-send="true">https://godbolt.org/g/tTyxpf</a> is
a dot product reduction loop multipying sign extended 16-bit
values to produce a 32-bit accumulated result. The x86
backend is currently not able to optimize it as well as gcc
and icc. The IR we are getting from the loop vectorizer has
several v8i32 adds and muls inside the loop. These are fed
by v8i16 loads and sexts from v8i16 to v8i32. The x86
backend recognizes that these are addition reductions of
multiplication so we use the vpmaddwd instruction which
calculates 32-bit products from 16-bit inputs and does a
horizontal add of adjacent pairs. A vpmaddwd given two v8i16
inputs will produce a v4i32 result.</div>
<div><br>
</div>
<div>In the example code, because we are reducing the number
of elements from 8->4 in the vpmaddwd step we are left
with a width mismatch between vpmaddwd and the vpaddd
instruction that we use to sum with the results from the
previous loop iterations. We rely on the fact that a 128-bit
vpmaddwd zeros the upper bits of the register so that we can
use a 256-bit vpaddd instruction so that the upper elements
can keep going around the loop without being disturbed in
case they weren't initialized to 0. But this still means the
vpmaddwd instruction is doing half the amount of work the
CPU is capable of if we had been able to use a 256-bit
vpmaddwd instruction. Additionally, future x86 CPUs will be
gaining an instruction that can do VPMADDWD and VPADDD in
one instruction, but that width mismatch makes that
instruction difficult to utilize.</div>
<div><br>
</div>
<div>In order for the backend to handle this better it would
be great if we could have something like two v32i8 loads,
two shufflevectors to extract the even elements and the odd
elements to create four v16i8 pieces.</div>
</div>
</blockquote>
<br>
Why v*i8 loads? I thought that we have 16-bit and 32-bit types
here?<br>
<br>
<blockquote type="cite"
cite="mid:CAF7ks-PmFE2PswYKW_0jDGLyZFCaOvD70sFAbvoJZ__X1aTJxw@mail.gmail.com">
<div dir="ltr">
<div>Sign extend each of those pieces. Multiply the two even
pieces and the two odd pieces separately, sum those results
with a v8i32 add. Then another v8i32 add to accumulate the
previous loop iterations. Then ensures that no pieces exceed
the target vector width and the final operation is correctly
sized to go around the loop in one register. All but the
last add can then be pattern matched to vpmaddwd as proposed
in <a href="https://reviews.llvm.org/D49636"
moz-do-not-send="true">https://reviews.llvm.org/D49636</a>.
And for the future CPU the whole thing can be matched to the
new instruction.<br>
</div>
<div><br>
</div>
<div>Do other targets have a similar instruction or a similar
issue to this? Is this something we can solve in the loop
vectorizer? Or should we have a separate IR transformation
that can recognize this pattern and generate the new
sequence? As a separate pass we would need to pair two
vector loads together, remove a reduction step outside the
loop and remove half the phis assuming the loop was
partially unrolled. Or if there was only one add/mul inside
the loop we'd have to reduce its width and the width of the
phi.</div>
</div>
</blockquote>
<br>
Can you explain how the desired code from the vectorizer differs
from the code that the vectorizer produces if you add '#pragma
clang loop vectorize(enable) vectorize_width(16)' above the loop?
I tried it in your godbolt example and the generated code looks
very similar to the icc-generated code.<br>
</blockquote>
<br>
(specifically, I mean this: <a class="moz-txt-link-freetext" href="https://godbolt.org/g/LJA38e">https://godbolt.org/g/LJA38e</a>)<br>
<br>
<blockquote type="cite"
cite="mid:18f3ab75-600e-8b64-03b0-e7ac9e50c9cb@anl.gov"> <br>
Thanks again,<br>
Hal<br>
<br>
<blockquote type="cite"
cite="mid:CAF7ks-PmFE2PswYKW_0jDGLyZFCaOvD70sFAbvoJZ__X1aTJxw@mail.gmail.com">
<div dir="ltr">
<div><br>
</div>
Thanks,<br clear="all">
<div>
<div dir="ltr"
class="gmail-m_-1264842386137689721gmail_signature">~Craig</div>
</div>
</div>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
<!--'"--><br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
LLVM Developers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>
<a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</body>
</html>