<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p><br>
</p>
<div class="moz-cite-prefix">On 09/04/2017 12:12 AM, Xinliang David
Li wrote:<br>
</div>
<blockquote
cite="mid:CAAkRFZ+o+93NRd=8kGHAbA+VmTpERhWrBRQKAMp075gEh4846g@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Sun, Sep 3, 2017 at 9:23 PM, Hal
Finkel <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"><span class="">
<p><br>
</p>
<div class="m_-5573162088652840693moz-cite-prefix">On
09/03/2017 11:06 PM, Xinliang David Li wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">I think we can think this in another
way.
<div><br>
</div>
<div>For modern CPU architectures which supports
store forwarding with store queues, it is
generally not "safe" to blindly do local
optimizations to widen the load/stores</div>
</div>
</blockquote>
<br>
</span> Why not widen stores? Generally the problem with
store forwarding is where the load is wider than the
store (plus alignment issues).<span class=""><br>
<br>
</span></div>
</blockquote>
<div><br>
</div>
<div>True, but probably with some caveats which are target
dependent. Store widening also requires additional bit
operations (and possibly addition load), so the it is
tricky to model the the overall benefit. </div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"><span class="">
<blockquote type="cite">
<div dir="ltr">
<div> without sophisticated inter-procedural
analysis. Doing so will run the risk of greatly
reduce performance of some programs. Keep the
naturally aligned load/store using its natural
type is safer. </div>
<div><br>
</div>
<div>Does it make sense?</div>
</div>
</blockquote>
<br>
</span> It makes sense. I could, however, say the same
thing about inlining. We need to make inlining decisions
locally, but they have global impact. Nevertheless, we
need to make inlining decisions, and there's no
practical way to make them in a truly non-local way.<br>
</div>
</blockquote>
<div><br>
</div>
<div>Speaking of inlining, we are actually thinking of ways
to make the decisions more globally optimal, but that is
off topic.</div>
</div>
</div>
</div>
</blockquote>
<br>
Neat.<br>
<br>
<blockquote
cite="mid:CAAkRFZ+o+93NRd=8kGHAbA+VmTpERhWrBRQKAMp075gEh4846g@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
We also don't pessimize common cases to improve
performance in rare cases. In the common case, reducing
pressure on the memory units, and reducing the critical
path, seem likely to me to be optimal. If that's not
true, or doing otherwise has negligible cost (but can
prevent rare downsides), we should certainly consider
those options.</div>
</blockquote>
<div><br>
</div>
<div>Since we don't do load widening for non-bitfield cases
(but the only the very limited case of naturally aligned
bitfields) so it is hard to say we pessimize common cases
for rare cases:</div>
<div><br>
</div>
<div>1) the upside doing widening such access is not high
from experience with other compiler (which does not do so)</div>
<div>2) there is obvious downside of introducing additional
extract instructions which degrades performance</div>
<div>3) there is obvious downside of severely degrading
performance when store forwarding is blocked.</div>
</div>
</div>
</div>
</blockquote>
<br>
I suspect that it's relatively rare to hit these store-to-load
forwarding issues compared to the number of times the systems stores
or loads to bitfields. In any case, I did some experiments on my
Haswell system and found that the load from Wei's benchmark which is
split into two loads, compared to the single load version, is 0.012%
slower. I, indeed, won't worry about that too much. On my P8, I
couldn't measure a difference. Obviously, this does somewhat miss
the point, as the real cost in this kind of thing comes in stressing
the memory units in code with a lot going on, not in isolated cases.<br>
<br>
Nevertheless, I think that you've convinced me that this is a
least-bad solution. I'll want a flag preserving the old behavior.
Something like -fwiden-bitfield-accesses (modulo bikeshedding).<br>
<br>
<blockquote
cite="mid:CAAkRFZ+o+93NRd=8kGHAbA+VmTpERhWrBRQKAMp075gEh4846g@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div><br>
</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
And none of this answers the question of whether it's
better to have the store wider or the load split and
narrow.<br>
</div>
</blockquote>
<div><br>
</div>
<div><br>
</div>
<div>It seems safer to do store widening more aggressively
to avoid store forwarding stall issue, but doing this
aggressively may also mean other runtime overhead
introduced (extra load, data merge etc).</div>
</div>
</div>
</div>
</blockquote>
<br>
Yes. Wei confirmed that this is slower.<br>
<br>
Thanks again,<br>
Hal<br>
<br>
<blockquote
cite="mid:CAAkRFZ+o+93NRd=8kGHAbA+VmTpERhWrBRQKAMp075gEh4846g@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div><br>
</div>
<div>Thanks,</div>
<div><br>
</div>
<div>David</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
Thanks again,<br>
Hal
<div>
<div class="h5"><br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div><br>
</div>
<div>David</div>
<div><br>
</div>
<div><br>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Sun, Sep 3, 2017 at
8:55 PM, Hal Finkel <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:hfinkel@anl.gov"
target="_blank">hfinkel@anl.gov</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"><span>
<p><br>
</p>
<div
class="m_-5573162088652840693m_4449805550791560709moz-cite-prefix">On
09/03/2017 10:38 PM, Xinliang David Li
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Store forwarding stall
cost is usually much higher compared
with a load hitting L1 cache. For
instance, on Haswell, the latter is
~4 cycles, while the store
forwarding stalls cost about 10
cycles more than a successful store
forwarding, which is roughly 15
cycles. In some scenarios, the store
forwarding stalls can be as high as
50 cycles. See Agner's
documentation. <br>
</div>
</blockquote>
<br>
</span> I understand. As I understand it,
there are two potential ways to fix this
problem:<br>
<br>
1. You can make the store wider (to match
the size of the wide load, thus permitting
forwarding).<br>
2. You can make the load smaller (to
match the size of the small store, thus
permitting forwarding).<br>
<br>
At least in this benchmark, which is a
better solution?<br>
<br>
Thanks again,<br>
Hal
<div>
<div class="m_-5573162088652840693h5"><br>
<br>
<blockquote type="cite">
<div dir="ltr">
<div><br>
</div>
<div>In other words, the optimizer
needs to be taught to avoid
defeating the HW pipeline
feature as much as possible.</div>
<div><br>
</div>
<div>David</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Sun,
Sep 3, 2017 at 6:32 PM, Hal
Finkel <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:hfinkel@anl.gov"
target="_blank">hfinkel@anl.gov</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div
class="m_-5573162088652840693m_4449805550791560709HOEnZb">
<div
class="m_-5573162088652840693m_4449805550791560709h5"><br>
On 09/03/2017 03:44 PM,
Wei Mi wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
On Sat, Sep 2, 2017 at
6:04 PM, Hal Finkel <<a
moz-do-not-send="true"
href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
On 08/22/2017 10:56
PM, Wei Mi via
llvm-commits wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
On Tue, Aug 22, 2017
at 7:03 PM, Xinliang
David Li <<a
moz-do-not-send="true"
href="mailto:davidxl@google.com" target="_blank">davidxl@google.com</a>><br>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0
0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<br>
On Tue, Aug 22,
2017 at 6:37 PM,
Chandler Carruth
via Phabricator<br>
<<a
moz-do-not-send="true"
href="mailto:reviews@reviews.llvm.org" target="_blank">reviews@reviews.llvm.org</a>>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0
0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
chandlerc added
a comment.<br>
<br>
I'm really not a
fan of the
degree of
complexity and
subtlety that
this<br>
introduces into
the frontend,
all to allow
particular
backend<br>
optimizations.<br>
<br>
I feel like this
is Clang working
around a
fundamental
deficiency in<br>
LLVM<br>
and we should
instead find a
way to fix this
in LLVM itself.<br>
<br>
As has been
pointed out
before, user
code can
synthesize large
integers<br>
that small bit
sequences are
extracted from,
and Clang and
LLVM should<br>
handle those
just as well as
actual
bitfields.<br>
<br>
Can we see how
far we can push
the LLVM side
before we add
complexity to<br>
Clang here? I
understand that
there remain
challenges to
LLVM's stuff,<br>
but I<br>
don't think
those challenges
make *all* of
the LLVM
improvements off
the<br>
table, I don't
think we've
exhausted all
ways of
improving the
LLVM<br>
changes<br>
being proposed,
and I think we
should still
land all of
those and<br>
re-evaluate how
important these
issues are when
all of that is
in place.<br>
</blockquote>
<br>
The main challenge
of doing this in
LLVM is that
inter-procedural<br>
analysis<br>
(and possibly
cross module) is
needed (for store
forwarding
issues).<br>
<br>
Wei, perhaps you
can provide
concrete test case
to illustrate the
issue<br>
so<br>
that reviewers
have a good
understanding.<br>
<br>
David<br>
</blockquote>
Here is a runable
testcase:<br>
--------------------
1.cc
------------------------<br>
class A {<br>
public:<br>
unsigned long
f1:2;<br>
unsigned long
f2:6;<br>
unsigned long
f3:8;<br>
unsigned long
f4:4;<br>
};<br>
A a;<br>
unsigned long b;<br>
unsigned long N =
1000000000;<br>
<br>
__attribute__((noinline))<br>
void foo() {<br>
a.f3 = 3;<br>
}<br>
<br>
__attribute__((noinline))<br>
void goo() {<br>
b = a.f3;<br>
}<br>
<br>
int main() {<br>
unsigned long i;<br>
for (i = 0; i
< N; i++) {<br>
foo();<br>
goo();<br>
}<br>
}<br>
------------------------------<wbr>------------------------------<br>
Now trunk takes
about twice running
time compared with
trunk + this<br>
patch. That is
because trunk
shrinks the store of
a.f3 in foo (Done by<br>
DagCombiner) but not
shrink the load of
a.f3 in goo, so
store<br>
forwarding will be
blocked.<br>
</blockquote>
<br>
I can confirm that
this is true on
Haswell and also on an
POWER8.<br>
Interestingly, on a
POWER7, the
performance is the
same either way (on
the<br>
good side). I ran the
test case as presented
and where I replaced
f3 with a<br>
non-bitfield unsigned
char member. Thinking
that the POWER7 result
might be<br>
because it's
big-Endian, I flipped
the order of the
fields, and found that<br>
the version where f3
is not a bitfield is
faster than otherwise,
but only by<br>
12.5%.<br>
<br>
Why, in this case,
don't we shrink the
load? It seems like we
should (and it<br>
seems like a
straightforward case).<br>
<br>
Thanks again,<br>
Hal<br>
<br>
</blockquote>
Hal, thanks for trying
the test.<br>
<br>
Yes, it is
straightforward to
shrink the load in the
test. I can<br>
change the testcase a
little to show why it is
sometime difficult to<br>
shrink the load:<br>
<br>
class A {<br>
public:<br>
unsigned long f1:16;<br>
unsigned long f2:16;<br>
unsigned long f3:16;<br>
unsigned long f4:8;<br>
};<br>
A a;<br>
bool b;<br>
unsigned long N =
1000000000;<br>
<br>
__attribute__((noinline))<br>
void foo() {<br>
a.f4 = 3;<br>
}<br>
<br>
__attribute__((noinline))<br>
void goo() {<br>
b = (a.f4 == 0
&& a.f3 ==
(unsigned short)-1);<br>
}<br>
<br>
int main() {<br>
unsigned long i;<br>
for (i = 0; i < N;
i++) {<br>
foo();<br>
goo();<br>
}<br>
}<br>
<br>
For the load a.f4 in
goo, it is diffcult to
motivate its shrink
after<br>
instcombine because the
comparison with a.f3 and
the comparison with<br>
a.f4 are merged:<br>
<br>
define void @_Z3goov()
local_unnamed_addr #0 {<br>
%1 = load i64, i64*
bitcast (%class.A* @a to
i64*), align 8<br>
%2 = and i64 %1,
0xffffff00000000<br>
%3 = icmp eq i64 %2,
0xffff00000000<br>
%4 = zext i1 %3 to i8<br>
store i8 %4, i8* @b,
align 1, !tbaa !2<br>
ret void<br>
}<br>
</blockquote>
<br>
</div>
</div>
Exactly. But this behavior is
desirable, locally. There's no
good answer here: We either
generate extra load traffic
here (because we need to load
the fields separately), or we
widen the store (which
generates extra load traffic
there). Do you know, in terms
of performance, which is
better in this case (i.e., is
it better to widen the store
or split the load)?<span
class="m_-5573162088652840693m_4449805550791560709HOEnZb"><font
color="#888888"><br>
<br>
-Hal</font></span>
<div
class="m_-5573162088652840693m_4449805550791560709HOEnZb">
<div
class="m_-5573162088652840693m_4449805550791560709h5"><br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<br>
Thanks,<br>
Wei.<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
The testcases shows
the potential
problem of store
shrinking. Before<br>
we decide to do
store shrinking, we
need to know all the
related loads<br>
will be shrunk, and
that requires IPA
analysis. Otherwise,
when load<br>
shrinking was
blocked for some
difficult case (Like
the instcombine<br>
case described in<br>
<a
moz-do-not-send="true"
href="https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html"
rel="noreferrer"
target="_blank">https://www.mail-archive.com/c<wbr>fe-commits@lists.llvm.org/msg6<wbr>5085.html</a>),<br>
performance
regression will
happen.<br>
<br>
Wei.<br>
<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0
0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<blockquote
class="gmail_quote"
style="margin:0
0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<br>
Repository:<br>
rL LLVM<br>
<br>
<a
moz-do-not-send="true"
href="https://reviews.llvm.org/D36562" rel="noreferrer" target="_blank">https://reviews.llvm.org/D3656<wbr>2</a><br>
<br>
<br>
<br>
</blockquote>
</blockquote>
______________________________<wbr>_________________<br>
llvm-commits mailing
list<br>
<a
moz-do-not-send="true"
href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a><br>
<a
moz-do-not-send="true"
href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits"
rel="noreferrer"
target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-commits</a><br>
</blockquote>
<br>
--<br>
Hal Finkel<br>
Lead, Compiler
Technology and
Programming Languages<br>
Leadership Computing
Facility<br>
Argonne National
Laboratory<br>
<br>
</blockquote>
</blockquote>
<br>
-- <br>
Hal Finkel<br>
Lead, Compiler Technology
and Programming Languages<br>
Leadership Computing
Facility<br>
Argonne National
Laboratory<br>
<br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<pre class="m_-5573162088652840693m_4449805550791560709moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
<pre class="m_-5573162088652840693moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</body>
</html>