<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p><br>
</p>
<br>
<div class="moz-cite-prefix">On 07/24/2018 12:07 PM, Craig Topper
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAF7ks-P7fzhig6N8UWzFPyMDn8X5wRfvFLaUPjNxXrFjgVEweQ@mail.gmail.com">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<div dir="ltr">With maximize-bandwidth I'd still end up with the
extra vpxors above the loop and the extra addition reduction
steps at the end that we get from forcing the vf to 16 right?</div>
</blockquote>
<br>
Yea, I think we'd need something else to help with that. My
underlying thought here is that, based on your description, we
really do want a VF of 16 (because we want to use 256-bit loads,
etc.). And so, when thinking about how to fix things, we should
start with looking at the VF = 16 output, and not the VF = 8 output,
as the right starting point for the backend.<br>
<br>
-Hal<br>
<br>
<blockquote type="cite"
cite="mid:CAF7ks-P7fzhig6N8UWzFPyMDn8X5wRfvFLaUPjNxXrFjgVEweQ@mail.gmail.com">
<div dir="ltr">
<div><br clear="all">
<div>
<div dir="ltr" class="gmail_signature"
data-smartmail="gmail_signature">~Craig</div>
</div>
<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr">On Tue, Jul 24, 2018 at 10:04 AM Craig Topper
<<a href="mailto:craig.topper@gmail.com"
moz-do-not-send="true">craig.topper@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr"><br>
<br>
<div class="gmail_quote">
<div dir="ltr">On Tue, Jul 24, 2018 at 6:10 AM Hal Finkel
<<a href="mailto:hfinkel@anl.gov" target="_blank"
moz-do-not-send="true">hfinkel@anl.gov</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> <br>
<div
class="m_-1066914453783556908gmail-m_5656040821941643324moz-cite-prefix">On
07/23/2018 06:37 PM, Craig Topper wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr"><br clear="all">
<div>
<div dir="ltr"
class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104gmail_signature">~Craig</div>
</div>
<br>
<br>
<div class="gmail_quote">
<div dir="ltr">On Mon, Jul 23, 2018 at 4:24 PM
Hal Finkel <<a
href="mailto:hfinkel@anl.gov"
target="_blank" moz-do-not-send="true">hfinkel@anl.gov</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> <br>
<div
class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104m_7042871795306411262moz-cite-prefix">On
07/23/2018 05:22 PM, Craig Topper wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Hello all,</div>
<div><br>
</div>
<div>This code <a
href="https://godbolt.org/g/tTyxpf"
target="_blank"
moz-do-not-send="true">https://godbolt.org/g/tTyxpf</a> is
a dot product reduction loop
multipying sign extended 16-bit values
to produce a 32-bit accumulated
result. The x86 backend is currently
not able to optimize it as well as gcc
and icc. The IR we are getting from
the loop vectorizer has several v8i32
adds and muls inside the loop. These
are fed by v8i16 loads and sexts from
v8i16 to v8i32. The x86 backend
recognizes that these are addition
reductions of multiplication so we use
the vpmaddwd instruction which
calculates 32-bit products from 16-bit
inputs and does a horizontal add of
adjacent pairs. A vpmaddwd given two
v8i16 inputs will produce a v4i32
result.</div>
</div>
</blockquote>
</div>
</blockquote>
<div><br>
</div>
<div>That godbolt link seems wrong. It wasn't
supposed to be clang IR. This should be right.</div>
<div> </div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<blockquote type="cite">
<div dir="ltr">
<div><br>
</div>
<div>In the example code, because we are
reducing the number of elements from
8->4 in the vpmaddwd step we are
left with a width mismatch between
vpmaddwd and the vpaddd instruction
that we use to sum with the results
from the previous loop iterations. We
rely on the fact that a 128-bit
vpmaddwd zeros the upper bits of the
register so that we can use a 256-bit
vpaddd instruction so that the upper
elements can keep going around the
loop without being disturbed in case
they weren't initialized to 0. But
this still means the vpmaddwd
instruction is doing half the amount
of work the CPU is capable of if we
had been able to use a 256-bit
vpmaddwd instruction. Additionally,
future x86 CPUs will be gaining an
instruction that can do VPMADDWD and
VPADDD in one instruction, but that
width mismatch makes that instruction
difficult to utilize.</div>
<div><br>
</div>
<div>In order for the backend to handle
this better it would be great if we
could have something like two v32i8
loads, two shufflevectors to extract
the even elements and the odd elements
to create four v16i8 pieces.</div>
</div>
</blockquote>
<br>
Why v*i8 loads? I thought that we have
16-bit and 32-bit types here?<br>
</div>
</blockquote>
<div><br>
</div>
<div>Oops that should have been v16i16. Mixed up
my 256-bit types.</div>
<div> </div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> <br>
<blockquote type="cite">
<div dir="ltr">
<div>Sign extend each of those pieces.
Multiply the two even pieces and the
two odd pieces separately, sum those
results with a v8i32 add. Then another
v8i32 add to accumulate the previous
loop iterations.</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
<br>
I'm still missing something. Why do you want to
separate out the even and odd parts instead of just
adding up the first half of the numbers and the second
half?<br>
</div>
</blockquote>
<div><br>
</div>
<div>Doing even/odd matches up with a pattern I already
have to support for the code in <a
href="https://reviews.llvm.org/D49636" target="_blank"
moz-do-not-send="true">https://reviews.llvm.org/D49636</a>.
I wouldn't even need to detect is as a reduction to do
the reassocation since even/odd exactly matches the
behavior of the instruction. But you're right we could
also just detect the reduction and add two halves.</div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> <br>
Thanks again,<br>
Hal<br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<blockquote type="cite">
<div dir="ltr">
<div> Then ensures that no pieces exceed
the target vector width and the final
operation is correctly sized to go
around the loop in one register. All
but the last add can then be pattern
matched to vpmaddwd as proposed in <a
href="https://reviews.llvm.org/D49636" target="_blank"
moz-do-not-send="true">https://reviews.llvm.org/D49636</a>.
And for the future CPU the whole thing
can be matched to the new instruction.<br>
</div>
<div><br>
</div>
<div>Do other targets have a similar
instruction or a similar issue to
this? Is this something we can solve
in the loop vectorizer? Or should we
have a separate IR transformation that
can recognize this pattern and
generate the new sequence? As a
separate pass we would need to pair
two vector loads together, remove a
reduction step outside the loop and
remove half the phis assuming the loop
was partially unrolled. Or if there
was only one add/mul inside the loop
we'd have to reduce its width and the
width of the phi.</div>
</div>
</blockquote>
<br>
Can you explain how the desired code from
the vectorizer differs from the code that
the vectorizer produces if you add '#pragma
clang loop vectorize(enable)
vectorize_width(16)' above the loop? I
tried it in your godbolt example and the
generated code looks very similar to the
icc-generated code.<br>
</div>
</blockquote>
<div><br>
</div>
<div>It's similar, but the vpxor %xmm0, %xmm0,
%xmm0 is being unnecessarily carried across
the loop. It's then redundantly added twice in
the reduction after the loop despite it being
0. This happens because we basically tricked
the backend into generating a 256-bit vpmaddwd
concated with a 256-bit zero vector going into
a 512-bit vaddd before type legalization. The
512-bit concat and vpaddd get split during
type legalization, and the high half of the
add gets constant folded away. I'm guessing we
probably finished with 4 vpxors before the
loop but MachineCSE(or some other pass?)
combined two of them when it figured out the
loop didn't modify them.</div>
<div> </div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> <br>
Thanks again,<br>
Hal<br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div><br>
</div>
Thanks,<br clear="all">
<div>
<div dir="ltr"
class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104m_7042871795306411262gmail-m_-1264842386137689721gmail_signature">~Craig</div>
</div>
</div>
</blockquote>
<br>
<pre class="m_-1066914453783556908gmail-m_5656040821941643324m_-4456769814224348104m_7042871795306411262moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</div>
</blockquote>
</div>
</div>
</blockquote>
<br>
<pre class="m_-1066914453783556908gmail-m_5656040821941643324moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</body>
</html>