<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>I've recently mentioned in a few places that I'm interested in

      enhancing the loop vectorizer to handle multiple exit loops, and

      have been asked to share plans.  This email is intended to a)

      share my current thinking and b) help spark discussion among

      interested parties.  I do need to warn that my near term plans for

      this have been delayed; I got pulled into an internal project

      instead.</p>

    <p><b>Background</b></p>

    <p>At the moment, our loop vectorizer does not handle any loop with

      more than one exit, or where that sole exit is not from the latch

      block.  Interestingly, it does handle internal predication within

      a single loop iteration.  This results in a slightly odd behavior

      where a loop which can be written with either a continue or a

      break can exhibit wildly different performance depending on that

      choice.  It does hint at a possible direction for implementation

      though, and implies that most of the cost modeling pieces are

      already in place.<br>

    </p>

    <p>The use cases I'm looking at basically fall into two buckets:</p>

    <p>for (int i = 0; i < N; i++) {

      <br>

        if (expr(a[i])) break;

      <br>

        ... other vectorizable stuff ...

      <br>

      }

      <br>

      <br>

      for (int i = 0; i < N; i++) {

      <br>

        if (expr(i)) break;

      <br>

        ... other vectorizable stuff ...

      <br>

      }

      <br>

      <br>

      The former are the actually "interesting" examples.  The later are

      cases where we missed eliminating some check we could have, but

      not-vectorizing creates an unfortunate performance cliff.

      <br>

    </p>

    <p><b>Framing</b></p>

    <p>We have three important sub-cases to handle.  <br>

    </p>

    <p>First, there are all the cases where we could have handled the

      multiple exit loop, but chose not to for implementation reasons. 

      A great example is:</p>

    <p>for (int i = 0; i < N; i++) {<br>

        if (i > M) break;<br>

        a[i] = i;<br>

      }</p>

    <p>In this case, SCEV can happily compute the actual exit bound of

      the loop, and we could use the existing vector-body/scalar

      slow-path structure.  The only change needed would be to exit the

      vector body earlier.  (There are some details here, but it's

      almost as easy as I'm describing if my POC patch isn't missing

      something major.)</p>

    <p>There's a couple other cases like this.  I suspect we can get

      decent mileage out of just generalizing the existing code.  <br>

    </p>

    <p><br>

    </p>

    <p>Second, there are the cases where we actually have to handle

      iterations within the vector-body with predication (or

      speculation).  The good news is that there's already support in

      code for predication, we just need to add another source of

      potential predication.  The bad news is that's a fairly major

      change.  <br>

    </p>

    <p>Our challenge will be finding a runtime check for

      dereferenceability.  Consider the following example:<span

        data-sigil="slippery"><br>

        for (int i = 0; i < N; i++) <br>

          if (a[i] == 0) return false;<br>

        return true;</span></p>

    <p><span data-sigil="slippery">If 'a' is an alloca, or statically

        sized global, we can form a runtime check which ensures 'i' is

        in bounds and a speculative load will not fault.  <br>

      </span></p>

    <p>Here's a nice example we could handle with this approach.  <br>

    </p>

    <p>// Assume a and b are both statically sized.<br>

      <span data-sigil="slippery"><span data-sigil="slippery">for (int i

          = 0; i < N; i++) {<br>

            t = b[i];<br>

            if (t > M) throw();<br>

            sum += a[t];<br>

          }</span></span></p>

    <p><span data-sigil="slippery"><span data-sigil="slippery">(This is

          a classic a[b[i]] pattern, but with range checks added.)</span></span></p>

    <p><span data-sigil="slippery">This is broadly applicable enough to

        be useful, and in practice covers my use cases, but I'm hoping

        others have ideas for extensions here.</span><br>

      <span data-sigil="slippery"><span data-sigil="slippery"></span></span></p>

    Before resorting to that though, we have the potential to rely more

    on speculation safety reasoning.  I have a patch out for review

    currently (<span data-sigil="slippery"><span class="phui-oi-objname">D66688</span>

      <a href="https://reviews.llvm.org/D66688" class="phui-oi-link"

        title="[LoopVectorize] Leverage speculation safety to avoid

        masked.loads">[LoopVectorize] Leverage speculation safety to

        avoid masked.loads</a>) which fell out of some prototyping in

      this area; it benefits existing predication logic, so I separated

      it.  <br>

    </span>

    <p><span data-sigil="slippery">The other major challenge here is

        profitability.  Consider a simple loop such as:</span></p>

    <p><span data-sigil="slippery">// assume a[0:N] is known

        dereferenceable<br>

        for (int i = 0; i < N; i++) <br>

          if (a[i] == 0) return false;<br>

        return true;</span></p>

    <p><span data-sigil="slippery">If N is large, and the array is

        non-zero, then this is profitable to vectorize.  If a[0] == 0,

        then it isn't, regardless of the value of N.  <br>

      </span></p>

    <p><span data-sigil="slippery">Figuring out when to vectorize vs not

        for cases like this will require some thought.  I don't really

        have a good answer for this yet, other than when the profile on

        the early exit tells us it's rarely taken.  <br>

      </span></p>

    <p><span data-sigil="slippery"><br>

      </span></p>

    <p><span data-sigil="slippery">Third, for both of the former cases,

        we need to be able to compute exit values along the early exit

        edges.  We can get a lot of mileage out of simply skipping loops

        with exit values (i.e. lcssa phis), as our loop exit value

        rewriting will tend to eliminate them.  However, we will

        eventually want to handle this case, which will require

        generating some interesting complicated code down the exit path

        to figure out which iteration actually exit.  <br>

      </span></p>

    <p><span data-sigil="slippery">I see two general options here:</span></p>

    <p><span data-sigil="slippery">1) Use the vector-body/scalar-body

        idiom of today, and have the vector body exit with IV = I when

        any iteration in [I, I+VF-1] would exit.  (i.e. roll back)  <br>

      </span></p>

    <p><span data-sigil="slippery">2) Insert dedicated exit blocks which

        recompute exit conditions to determine exact exit value, and

        then let the vector body run all iterations in VF which contain

        the exiting iteration.  (Assumes predicated stores, and that the

        exit blocks skip the scalar loop)  <br>

      </span></p>

    <p><span data-sigil="slippery">I currently don't have a reason to

        strongly prefer one over the other.  (2) is conceptually cleaner

        and the one which keeps coming up in conversation, but (1) may

        be simpler to actually implement.  <br>

      </span></p>

    <p><br>

    </p>

    <p><b>Current Plans</b></p>

    <p>At the current moment, I'm reasonable sure that I'm going to get

      the resources to at least tackle some of the cases where we bail

      out unnecessarily.  This will be a huge practical improvement in

      vectorizing robustness, at (I hope) relatively little

      implementation cost.  <br>

    </p>

    <p>I'm also going to continue playing around with enhancements to

      our current dereferenceability logic.  I see that as being a key

      building block to make any predication based approach practical. 

      <br>

    </p>

    <p>I'm not sure I'm going to get to the predication support.  I'd

      like to, but am not sure my resource constraints allow it.  I'll

      also mention that I'm not at all sure how all of this might fit in

      with the VPLAN work.  I'd really welcome feedback on that; is what

      I'm proposing at all consistent with others plans?</p>

    <p><br>

    </p>

    <p>Philip<br>

    </p>

  </body>

</html>