<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>This project was finally completed exactly one month ago with

      change 95346ba87.  Support for multiple exit loop vectorization

      has now been in tree without reported problems long enough to be

      considered complete.  <br>

    </p>

    <p>If anyone is interested, I <a moz-do-not-send="true"

href="https://github.com/preames/public-notes/blob/master/multiple-exit-vectorization.rst">wrote

        up a summary</a> of the work and some thoughts on open problems

      in this space.  I'm not currently planning on continuing work

      here, and this writeup is me dumping my mental state for potential

      latter reconstruction if needed more than anything else, but

      others might find it interesting as well.</p>

    <p>Philip<br>

    </p>

    <div class="moz-cite-prefix">On 7/12/21 8:06 AM, Philip Reames

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:d50ee6d8-49df-2622-a26b-8c409b8e35e6@philipreames.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <p>Small progress update.</p>

      <p>Work on this largely stalled not long after I sent my last

        email due to a difficult to debug issue seen on PPC builders. 

        That was only recently resolved in e49d65f3.  As of now, the

        last patch for the analyzeable exit subset is now on review (<a

          class="moz-txt-link-freetext"

          href="https://reviews.llvm.org/D105817" moz-do-not-send="true">https://reviews.llvm.org/D105817</a>).  

        <br>

      </p>

      <p>Once that goes in, I don't plan to take this any further at

        this time.  This was a hobby project for me, and has taken much

        longer than I anticipated.  Unless I find someone to sponsor

        this work, I'll probably turn my personal hobby efforts towards

        easier and more immediately rewarding efforts.</p>

      <p>Philip</p>

      <div class="moz-cite-prefix">On 1/11/21 12:30 PM, Philip Reames

        via llvm-dev wrote:<br>

      </div>

      <blockquote type="cite"

        cite="mid:0cbdc208-6ccf-2ba6-9481-f0e540812005@philipreames.com">

        <meta http-equiv="Content-Type" content="text/html;

          charset=UTF-8">

        <p>Responding to this old thread to let interested parties know

          there's been some progress on this (finally).  <br>

        </p>

        <p>The first sub-item described below - multiple exit loops with

          computable trip counts - is in progress, and will likely be

          wrapped up in the not too distant future.  The first major

          change (4b33b2387) landed two weeks ago, two smaller changes

          are on review (<span class="phui-oi-objname"

            data-sigil="ungrabbable">D93725, </span><span

            class="phui-oi-objname" data-sigil="ungrabbable"><span

              class="phui-oi-objname" data-sigil="ungrabbable">and

              D93865), and there's likely only one major patch needed

              after that.</span></span></p>

        <p><span class="phui-oi-objname" data-sigil="ungrabbable"><span

              class="phui-oi-objname" data-sigil="ungrabbable">To my

              knowledge, there's been no progress on the second item and

              I'm not anticipating any in the near future.  <br>

            </span></span></p>

        <p><span class="phui-oi-objname" data-sigil="ungrabbable"><span

              class="phui-oi-objname" data-sigil="ungrabbable">Philip<br>

            </span></span></p>

        <div class="moz-cite-prefix">On 9/9/19 10:53 AM, Philip Reames

          wrote:<br>

        </div>

        <blockquote type="cite"

          cite="mid:a48a26de-46af-3814-3967-c395e8230e51@philipreames.com">

          <meta http-equiv="content-type" content="text/html;

            charset=UTF-8">

          <p>I've recently mentioned in a few places that I'm interested

            in enhancing the loop vectorizer to handle multiple exit

            loops, and have been asked to share plans.  This email is

            intended to a) share my current thinking and b) help spark

            discussion among interested parties.  I do need to warn that

            my near term plans for this have been delayed; I got pulled

            into an internal project instead.</p>

          <p><b>Background</b></p>

          <p>At the moment, our loop vectorizer does not handle any loop

            with more than one exit, or where that sole exit is not from

            the latch block.  Interestingly, it does handle internal

            predication within a single loop iteration.  This results in

            a slightly odd behavior where a loop which can be written

            with either a continue or a break can exhibit wildly

            different performance depending on that choice.  It does

            hint at a possible direction for implementation though, and

            implies that most of the cost modeling pieces are already in

            place.<br>

          </p>

          <p>The use cases I'm looking at basically fall into two

            buckets:</p>

          <p>for (int i = 0; i < N; i++) { <br>

              if (expr(a[i])) break; <br>

              ... other vectorizable stuff ... <br>

            } <br>

            <br>

            for (int i = 0; i < N; i++) { <br>

              if (expr(i)) break; <br>

              ... other vectorizable stuff ... <br>

            } <br>

            <br>

            The former are the actually "interesting" examples.  The

            later are cases where we missed eliminating some check we

            could have, but not-vectorizing creates an unfortunate

            performance cliff. <br>

          </p>

          <p><b>Framing</b></p>

          <p>We have three important sub-cases to handle.  <br>

          </p>

          <p>First, there are all the cases where we could have handled

            the multiple exit loop, but chose not to for implementation

            reasons.  A great example is:</p>

          <p>for (int i = 0; i < N; i++) {<br>

              if (i > M) break;<br>

              a[i] = i;<br>

            }</p>

          <p>In this case, SCEV can happily compute the actual exit

            bound of the loop, and we could use the existing

            vector-body/scalar slow-path structure.  The only change

            needed would be to exit the vector body earlier.  (There are

            some details here, but it's almost as easy as I'm describing

            if my POC patch isn't missing something major.)</p>

          <p>There's a couple other cases like this.  I suspect we can

            get decent mileage out of just generalizing the existing

            code.  <br>

          </p>

          <p><br>

          </p>

          <p>Second, there are the cases where we actually have to

            handle iterations within the vector-body with predication

            (or speculation).  The good news is that there's already

            support in code for predication, we just need to add another

            source of potential predication.  The bad news is that's a

            fairly major change.  <br>

          </p>

          <p>Our challenge will be finding a runtime check for

            dereferenceability.  Consider the following example:<span

              data-sigil="slippery"><br>

              for (int i = 0; i < N; i++) <br>

                if (a[i] == 0) return false;<br>

              return true;</span></p>

          <p><span data-sigil="slippery">If 'a' is an alloca, or

              statically sized global, we can form a runtime check which

              ensures 'i' is in bounds and a speculative load will not

              fault.  <br>

            </span></p>

          <p>Here's a nice example we could handle with this approach. 

            <br>

          </p>

          <p>// Assume a and b are both statically sized.<br>

            <span data-sigil="slippery"><span data-sigil="slippery">for

                (int i = 0; i < N; i++) {<br>

                  t = b[i];<br>

                  if (t > M) throw();<br>

                  sum += a[t];<br>

                }</span></span></p>

          <p><span data-sigil="slippery"><span data-sigil="slippery">(This

                is a classic a[b[i]] pattern, but with range checks

                added.)</span></span></p>

          <p><span data-sigil="slippery">This is broadly applicable

              enough to be useful, and in practice covers my use cases,

              but I'm hoping others have ideas for extensions here.</span><br>

            <span data-sigil="slippery"><span data-sigil="slippery"></span></span></p>

          Before resorting to that though, we have the potential to rely

          more on speculation safety reasoning.  I have a patch out for

          review currently (<span data-sigil="slippery"><span

              class="phui-oi-objname">D66688</span> <a

              href="https://reviews.llvm.org/D66688"

              class="phui-oi-link" title="[LoopVectorize] Leverage

              speculation safety to avoid masked.loads"

              moz-do-not-send="true">[LoopVectorize] Leverage

              speculation safety to avoid masked.loads</a>) which fell

            out of some prototyping in this area; it benefits existing

            predication logic, so I separated it.  <br>

          </span>

          <p><span data-sigil="slippery">The other major challenge here

              is profitability.  Consider a simple loop such as:</span></p>

          <p><span data-sigil="slippery">// assume a[0:N] is known

              dereferenceable<br>

              for (int i = 0; i < N; i++) <br>

                if (a[i] == 0) return false;<br>

              return true;</span></p>

          <p><span data-sigil="slippery">If N is large, and the array is

              non-zero, then this is profitable to vectorize.  If a[0]

              == 0, then it isn't, regardless of the value of N.  <br>

            </span></p>

          <p><span data-sigil="slippery">Figuring out when to vectorize

              vs not for cases like this will require some thought.  I

              don't really have a good answer for this yet, other than

              when the profile on the early exit tells us it's rarely

              taken.  <br>

            </span></p>

          <p><span data-sigil="slippery"><br>

            </span></p>

          <p><span data-sigil="slippery">Third, for both of the former

              cases, we need to be able to compute exit values along the

              early exit edges.  We can get a lot of mileage out of

              simply skipping loops with exit values (i.e. lcssa phis),

              as our loop exit value rewriting will tend to eliminate

              them.  However, we will eventually want to handle this

              case, which will require generating some interesting

              complicated code down the exit path to figure out which

              iteration actually exit.  <br>

            </span></p>

          <p><span data-sigil="slippery">I see two general options here:</span></p>

          <p><span data-sigil="slippery">1) Use the

              vector-body/scalar-body idiom of today, and have the

              vector body exit with IV = I when any iteration in [I,

              I+VF-1] would exit.  (i.e. roll back)  <br>

            </span></p>

          <p><span data-sigil="slippery">2) Insert dedicated exit blocks

              which recompute exit conditions to determine exact exit

              value, and then let the vector body run all iterations in

              VF which contain the exiting iteration.  (Assumes

              predicated stores, and that the exit blocks skip the

              scalar loop)  <br>

            </span></p>

          <p><span data-sigil="slippery">I currently don't have a reason

              to strongly prefer one over the other.  (2) is

              conceptually cleaner and the one which keeps coming up in

              conversation, but (1) may be simpler to actually

              implement.  <br>

            </span></p>

          <p><br>

          </p>

          <p><b>Current Plans</b></p>

          <p>At the current moment, I'm reasonable sure that I'm going

            to get the resources to at least tackle some of the cases

            where we bail out unnecessarily.  This will be a huge

            practical improvement in vectorizing robustness, at (I hope)

            relatively little implementation cost.  <br>

          </p>

          <p>I'm also going to continue playing around with enhancements

            to our current dereferenceability logic.  I see that as

            being a key building block to make any predication based

            approach practical.  <br>

          </p>

          <p>I'm not sure I'm going to get to the predication support. 

            I'd like to, but am not sure my resource constraints allow

            it.  I'll also mention that I'm not at all sure how all of

            this might fit in with the VPLAN work.  I'd really welcome

            feedback on that; is what I'm proposing at all consistent

            with others plans?</p>

          <p><br>

          </p>

          <p>Philip<br>

          </p>

        </blockquote>

        <br>

        <fieldset class="mimeAttachmentHeader"></fieldset>

        <pre class="moz-quote-pre" wrap="">_______________________________________________

LLVM Developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org" moz-do-not-send="true">llvm-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

      </blockquote>

    </blockquote>

  </body>

</html>