<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 16/01/2014, 23:47 , Andrew Trick
wrote:<br>
</div>
<blockquote
cite="mid:5AA9EF9E-1B57-4741-BFD0-A82473A8AF6C@apple.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<br>
<div>
<div>On Jan 15, 2014, at 4:13 PM, Diego Novillo <<a
moz-do-not-send="true" href="mailto:dnovillo@google.com">dnovillo@google.com</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<blockquote type="cite"><span style="font-family: Helvetica;
font-size: 12px; font-style: normal; font-variant: normal;
font-weight: normal; letter-spacing: normal; line-height:
normal; orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px; float:
none; display: inline !important;">Chandler also pointed me
at the vectorizer, which has its own</span><br
style="font-family: Helvetica; font-size: 12px; font-style:
normal; font-variant: normal; font-weight: normal;
letter-spacing: normal; line-height: normal; orphans: auto;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px;">
<span style="font-family: Helvetica; font-size: 12px;
font-style: normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height: normal;
orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px; float:
none; display: inline !important;">unroller. However, the
vectorizer only unrolls enough to serve the</span><br
style="font-family: Helvetica; font-size: 12px; font-style:
normal; font-variant: normal; font-weight: normal;
letter-spacing: normal; line-height: normal; orphans: auto;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px;">
<span style="font-family: Helvetica; font-size: 12px;
font-style: normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height: normal;
orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px; float:
none; display: inline !important;">target, it's not as
general as the runtime-triggered unroller. From</span><br
style="font-family: Helvetica; font-size: 12px; font-style:
normal; font-variant: normal; font-weight: normal;
letter-spacing: normal; line-height: normal; orphans: auto;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px;">
<span style="font-family: Helvetica; font-size: 12px;
font-style: normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height: normal;
orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px; float:
none; display: inline !important;">what I've seen, it will
get a maximum unroll factor of 2 on x86 (4 on</span><br
style="font-family: Helvetica; font-size: 12px; font-style:
normal; font-variant: normal; font-weight: normal;
letter-spacing: normal; line-height: normal; orphans: auto;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px;">
<span style="font-family: Helvetica; font-size: 12px;
font-style: normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height: normal;
orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px; float:
none; display: inline !important;">avx targets).
Additionally, the vectorizer only unrolls to aid</span><br
style="font-family: Helvetica; font-size: 12px; font-style:
normal; font-variant: normal; font-weight: normal;
letter-spacing: normal; line-height: normal; orphans: auto;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px;">
<span style="font-family: Helvetica; font-size: 12px;
font-style: normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height: normal;
orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px; float:
none; display: inline !important;">reduction variables. When
I forced the vectorizer to unroll these</span><br
style="font-family: Helvetica; font-size: 12px; font-style:
normal; font-variant: normal; font-weight: normal;
letter-spacing: normal; line-height: normal; orphans: auto;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px;">
<span style="font-family: Helvetica; font-size: 12px;
font-style: normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height: normal;
orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px; float:
none; display: inline !important;">loops, the performance
effects were nil.</span></blockquote>
</div>
<br>
<div>
<div>Vectorization and partial unrolling (aka runtime unrolling)
for ILP should to be the same pass. The profitability analysis
required in each case is very closely related, and you never
want to do one before or after the other. The analysis makes
sense even for targets without vector units. The “vector
unroller” has an extra restriction (unlike the LoopUnroll
pass) in that it must be able to interleave operations across
iterations. This is usually a good thing to check before
unrolling, but the compiler’s dependence analysis may be too
conservative in some cases.</div>
</div>
</blockquote>
<br>
In addition to tuning the cost model, I found that the vectorizer
does not even choose to get that far into its analysis for some
loops that I need unrolled. In this particular case, there are three
loops that need to be unrolled to get the performance I'm looking
for. Of the three, only one gets far enough in the analysis to
decide whether we unroll it or not.<br>
<br>
But I found a bigger issue. The loop optimizers run under the loop
pass manager (I am still trying to wrap my head around that. I find
it very odd and have not convinced myself why there is a separate
manager for loops). Inside the loop pass manager, I am not allowed
to call the block frequency analysis. Any attempts I make at
scheduling BF analysis, sends the compiler into an infinite loop
during initialization.<br>
<br>
Chandler suggested a way around the problem. I'll work on that
first.<br>
<br>
<blockquote
cite="mid:5AA9EF9E-1B57-4741-BFD0-A82473A8AF6C@apple.com"
type="cite">
<div>
<div>Currently, the cost model is conservative w.r.t unrolling
because we don't want to increase code size. But minimally, we
should unroll until we can saturate the resources/ports. e.g.
a loop with a single load should be unrolled x2 so we can do
two loads per cycle. If you can come up with improved
heuristics without generally impacting code size that’s great.</div>
</div>
</blockquote>
Oh, code size will always go up. That's pretty much unavoidable when
you decide to unroll. The trick here is to only unroll select loops.
The profiler does not tell you the trip count. What it will do is
cause the loop header to be excessively heavy wrt its parent in the
block frequency analysis. In this particular case, you get something
like:<br>
<br>
<tt>---- Block Freqs ----</tt><tt><br>
</tt><tt> entry = 1.0</tt><tt><br>
</tt><tt> entry -> if.else = 0.375</tt><tt><br>
</tt><tt> entry -> if.then = 0.625</tt><tt><br>
</tt><tt> if.then = 0.625</tt><tt><br>
</tt><tt> if.then -> if.end3 = 0.625</tt><tt><br>
</tt><tt> if.else = 0.375</tt><tt><br>
</tt><tt> if.else -> for.cond.preheader = 0.37487</tt><tt><br>
</tt><tt> if.else -> if.end3 = 0.00006</tt><tt><br>
</tt><tt> for.cond.preheader = 0.37487</tt><tt><br>
</tt><tt> for.cond.preheader -> for.body.lr.ph = 0.37463</tt><tt><br>
</tt><tt> for.cond.preheader -> for.end = 0.00018</tt><tt><br>
</tt><tt> for.body.lr.ph = 0.37463</tt><tt><br>
</tt><tt> for.body.lr.ph -> for.body = 0.37463</tt><tt><br>
</tt><b><tt> for.body = 682.0</tt></b><b><tt><br>
</tt></b><b><tt> for.body -> for.body = 681.65466</tt></b><tt><br>
</tt><tt> for.body -> for.end = 0.34527</tt><tt><br>
</tt><tt> for.end = 0.34545</tt><tt><br>
</tt><tt> for.end -> if.end3 = 0.34545</tt><tt><br>
</tt><tt> if.end3 = 0.9705<br>
<br>
</tt>Notice how the head of the loop has weight 682, which is 682x
the weight of its parent (the function entry, since this is an
outermost loop).<br>
<br>
With static heuristics, this ratio is significantly lower (about
3x).<br>
<br>
When we see this, we can decide to unroll the loop.<br>
<br>
<br>
Diego.<br>
</body>
</html>