<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Feb 21, 2014 at 10:45 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Chandler pointed out to me last week that recent x86 cores can also benefit from partial unrolling because of how their uop buffers and loop-stream detectors work (both Intel and AMD chips are similar in this regard).</blockquote>

</div><br>I just want to add a specific point of realization that occurred to me when we were discussing this, and influenced my feeling that we should look into using the partial unroller *in addition* to the loop vectorizer's unrolling.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">The latter is, rightfully, about widening the loop. It exposes ILP and other benefits. It is *not*, however, suitable to one thing which it is currently being used for: unrolling *purely* to hide the branch cost and/or properly fill the LSD or uop cache. For these purposes, restricting the unrolling to that which can be done in an *interleaved* fashion isn't always reasonable. Instead, we should also support doing this through concatentation.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">My general feeling is that we should essentially use the same size-upper-bound metric in both the vectorizer's unroller and this one, and unroll through interleaving as much as we can (subject to the independence of the iterations), and then continue unrolling with concatentation until we saturate whatever buffer size the targets wants.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">That make sense to folks?</div></div>