<div dir="ltr">With maximize-bandwidth I'd still end up with the extra vpxors above the loop and the extra addition reduction steps at the end that we get from forcing the vf to 16 right?<div><br clear="all"><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">~Craig</div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr">On Tue, Jul 24, 2018 at 10:04 AM Craig Topper <<a href="mailto:craig.topper@gmail.com">craig.topper@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel <<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<br>
<div class="m_-1066914453783556908gmail-m_5656040821941643324moz-cite-prefix">On 07/23/2018 06:37 PM, Craig Topper
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr"><br clear="all">
<div>
<div dir="ltr" class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104gmail_signature">~Craig</div>
</div>
<br>
<br>
<div class="gmail_quote">
<div dir="ltr">On Mon, Jul 23, 2018 at 4:24 PM Hal Finkel <<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> <br>
<div class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104m_7042871795306411262moz-cite-prefix">On
07/23/2018 05:22 PM, Craig Topper wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Hello all,</div>
<div><br>
</div>
<div>This code <a href="https://godbolt.org/g/tTyxpf" target="_blank">https://godbolt.org/g/tTyxpf</a> is
a dot product reduction loop multipying sign
extended 16-bit values to produce a 32-bit
accumulated result. The x86 backend is currently not
able to optimize it as well as gcc and icc. The IR
we are getting from the loop vectorizer has several
v8i32 adds and muls inside the loop. These are fed
by v8i16 loads and sexts from v8i16 to v8i32. The
x86 backend recognizes that these are addition
reductions of multiplication so we use the vpmaddwd
instruction which calculates 32-bit products from
16-bit inputs and does a horizontal add of adjacent
pairs. A vpmaddwd given two v8i16 inputs will
produce a v4i32 result.</div>
</div>
</blockquote>
</div>
</blockquote>
<div><br>
</div>
<div>That godbolt link seems wrong. It wasn't supposed to be
clang IR. This should be right.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<blockquote type="cite">
<div dir="ltr">
<div><br>
</div>
<div>In the example code, because we are reducing the
number of elements from 8->4 in the vpmaddwd step
we are left with a width mismatch between vpmaddwd
and the vpaddd instruction that we use to sum with
the results from the previous loop iterations. We
rely on the fact that a 128-bit vpmaddwd zeros the
upper bits of the register so that we can use a
256-bit vpaddd instruction so that the upper
elements can keep going around the loop without
being disturbed in case they weren't initialized to
0. But this still means the vpmaddwd instruction is
doing half the amount of work the CPU is capable of
if we had been able to use a 256-bit vpmaddwd
instruction. Additionally, future x86 CPUs will be
gaining an instruction that can do VPMADDWD and
VPADDD in one instruction, but that width mismatch
makes that instruction difficult to utilize.</div>
<div><br>
</div>
<div>In order for the backend to handle this better it
would be great if we could have something like two
v32i8 loads, two shufflevectors to extract the even
elements and the odd elements to create four v16i8
pieces.</div>
</div>
</blockquote>
<br>
Why v*i8 loads? I thought that we have 16-bit and 32-bit
types here?<br>
</div>
</blockquote>
<div><br>
</div>
<div>Oops that should have been v16i16. Mixed up my 256-bit
types.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> <br>
<blockquote type="cite">
<div dir="ltr">
<div>Sign extend each of those pieces. Multiply the
two even pieces and the two odd pieces separately,
sum those results with a v8i32 add. Then another
v8i32 add to accumulate the previous loop
iterations.</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
<br>
I'm still missing something. Why do you want to separate out the
even and odd parts instead of just adding up the first half of the
numbers and the second half?<br></div></blockquote><div><br></div><div>Doing even/odd matches up with a pattern I already have to support for the code in <a href="https://reviews.llvm.org/D49636" target="_blank">https://reviews.llvm.org/D49636</a>. I wouldn't even need to detect is as a reduction to do the reassocation since even/odd exactly matches the behavior of the instruction. But you're right we could also just detect the reduction and add two halves.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div bgcolor="#FFFFFF">
<br>
Thanks again,<br>
Hal<br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<blockquote type="cite">
<div dir="ltr">
<div> Then ensures that no pieces exceed the target
vector width and the final operation is correctly
sized to go around the loop in one register. All but
the last add can then be pattern matched to vpmaddwd
as proposed in <a href="https://reviews.llvm.org/D49636" target="_blank">https://reviews.llvm.org/D49636</a>.
And for the future CPU the whole thing can be
matched to the new instruction.<br>
</div>
<div><br>
</div>
<div>Do other targets have a similar instruction or a
similar issue to this? Is this something we can
solve in the loop vectorizer? Or should we have a
separate IR transformation that can recognize this
pattern and generate the new sequence? As a separate
pass we would need to pair two vector loads
together, remove a reduction step outside the loop
and remove half the phis assuming the loop was
partially unrolled. Or if there was only one add/mul
inside the loop we'd have to reduce its width and
the width of the phi.</div>
</div>
</blockquote>
<br>
Can you explain how the desired code from the vectorizer
differs from the code that the vectorizer produces if you
add '#pragma clang loop vectorize(enable)
vectorize_width(16)' above the loop? I tried it in your
godbolt example and the generated code looks very similar
to the icc-generated code.<br>
</div>
</blockquote>
<div><br>
</div>
<div>It's similar, but the vpxor %xmm0, %xmm0, %xmm0 is being
unnecessarily carried across the loop. It's then redundantly
added twice in the reduction after the loop despite it being
0. This happens because we basically tricked the backend
into generating a 256-bit vpmaddwd concated with a 256-bit
zero vector going into a 512-bit vaddd before type
legalization. The 512-bit concat and vpaddd get split during
type legalization, and the high half of the add gets
constant folded away. I'm guessing we probably finished with
4 vpxors before the loop but MachineCSE(or some other pass?)
combined two of them when it figured out the loop didn't
modify them.</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> <br>
Thanks again,<br>
Hal<br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div><br>
</div>
Thanks,<br clear="all">
<div>
<div dir="ltr" class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104m_7042871795306411262gmail-m_-1264842386137689721gmail_signature">~Craig</div>
</div>
</div>
</blockquote>
<br>
<pre class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104m_7042871795306411262moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</div>
</blockquote>
</div>
</div>
</blockquote>
<br>
<pre class="m_-1066914453783556908gmail-m_5656040821941643324moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</div>
</blockquote></div></div>
</blockquote></div>