<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-"><br>

> b) Remainder loops for hand-vectorized code. These will also not be unrolled - the trip-count is unknown, and doesn't have a known multiple. (We may end up with runtime unrolling and yet another "remainder loop", which doesn't really improve things.) And, of course, it's almost always a bad idea to vectorize these. (The exception may be something like hand-vectorization by 16, with a scalar remainder loop. We may want to vectorize that remainder by 4 and leave a smaller scalar remainder, but that sounds like a very small win.)<br>

<br>

</span>I agree, but I think we're going about this the wrong way. The cost of the branching and runtime checks need to be factored into the cost model (which will be relevant for low-trip-count loops), and that should naturally prevent this kind of messiness. Just not vectorizing low-trip-count loops is suboptimial because it will miss cases where vectorization is quite profitable.<br>

<div class="gmail-HOEnZb"><div class="gmail-h5"><br></div></div></blockquote><div><br></div><div>You're completely right, but this isn't new - it's just that it's being applied non-uniformly, depending on what exactly we know about the trip count. That is, we do it "the wrong way" for loops with a known exact trip count, and don't do it at all with loop with a known upper bound.</div><div><br></div><div>I want us to start treating all three cases (static exact, static bound, dynamic) in the same way, by using the "right" number for the trip-count. Using this number in a smarter way (by estimating the overhead cost, and then dividing it by the trip-count to get the per-iteration cost*) is, I think, orthogonal to actually getting the number right.</div><div><br></div><div>* Well, almost. For a loop with 7 iterations that we vectorize by 4, we aren't really spreading the cost among "1.75" vectorized iterations, but just the one. This is negligible for high trip counts, but the whole point is to evaluate it correctly for the low case.</div></div></div></div>