<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Small progress update.</p>

    <p>Work on this largely stalled not long after I sent my last email

      due to a difficult to debug issue seen on PPC builders.  That was

      only recently resolved in e49d65f3.  As of now, the last patch for

      the analyzeable exit subset is now on review

      (<a class="moz-txt-link-freetext" href="https://reviews.llvm.org/D105817">https://reviews.llvm.org/D105817</a>).   <br>

    </p>

    <p>Once that goes in, I don't plan to take this any further at this

      time.  This was a hobby project for me, and has taken much longer

      than I anticipated.  Unless I find someone to sponsor this work,

      I'll probably turn my personal hobby efforts towards easier and

      more immediately rewarding efforts.</p>

    <p>Philip</p>

    <div class="moz-cite-prefix">On 1/11/21 12:30 PM, Philip Reames via

      llvm-dev wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:0cbdc208-6ccf-2ba6-9481-f0e540812005@philipreames.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <p>Responding to this old thread to let interested parties know

        there's been some progress on this (finally).  <br>

      </p>

      <p>The first sub-item described below - multiple exit loops with

        computable trip counts - is in progress, and will likely be

        wrapped up in the not too distant future.  The first major

        change (4b33b2387) landed two weeks ago, two smaller changes are

        on review (<span class="phui-oi-objname"

          data-sigil="ungrabbable">D93725, </span><span

          class="phui-oi-objname" data-sigil="ungrabbable"><span

            class="phui-oi-objname" data-sigil="ungrabbable">and

            D93865), and there's likely only one major patch needed

            after that.</span></span></p>

      <p><span class="phui-oi-objname" data-sigil="ungrabbable"><span

            class="phui-oi-objname" data-sigil="ungrabbable">To my

            knowledge, there's been no progress on the second item and

            I'm not anticipating any in the near future.  <br>

          </span></span></p>

      <p><span class="phui-oi-objname" data-sigil="ungrabbable"><span

            class="phui-oi-objname" data-sigil="ungrabbable">Philip<br>

          </span></span></p>

      <div class="moz-cite-prefix">On 9/9/19 10:53 AM, Philip Reames

        wrote:<br>

      </div>

      <blockquote type="cite"

        cite="mid:a48a26de-46af-3814-3967-c395e8230e51@philipreames.com">

        <meta http-equiv="content-type" content="text/html;

          charset=UTF-8">

        <p>I've recently mentioned in a few places that I'm interested

          in enhancing the loop vectorizer to handle multiple exit

          loops, and have been asked to share plans.  This email is

          intended to a) share my current thinking and b) help spark

          discussion among interested parties.  I do need to warn that

          my near term plans for this have been delayed; I got pulled

          into an internal project instead.</p>

        <p><b>Background</b></p>

        <p>At the moment, our loop vectorizer does not handle any loop

          with more than one exit, or where that sole exit is not from

          the latch block.  Interestingly, it does handle internal

          predication within a single loop iteration.  This results in a

          slightly odd behavior where a loop which can be written with

          either a continue or a break can exhibit wildly different

          performance depending on that choice.  It does hint at a

          possible direction for implementation though, and implies that

          most of the cost modeling pieces are already in place.<br>

        </p>

        <p>The use cases I'm looking at basically fall into two buckets:</p>

        <p>for (int i = 0; i < N; i++) { <br>

            if (expr(a[i])) break; <br>

            ... other vectorizable stuff ... <br>

          } <br>

          <br>

          for (int i = 0; i < N; i++) { <br>

            if (expr(i)) break; <br>

            ... other vectorizable stuff ... <br>

          } <br>

          <br>

          The former are the actually "interesting" examples.  The later

          are cases where we missed eliminating some check we could

          have, but not-vectorizing creates an unfortunate performance

          cliff. <br>

        </p>

        <p><b>Framing</b></p>

        <p>We have three important sub-cases to handle.  <br>

        </p>

        <p>First, there are all the cases where we could have handled

          the multiple exit loop, but chose not to for implementation

          reasons.  A great example is:</p>

        <p>for (int i = 0; i < N; i++) {<br>

            if (i > M) break;<br>

            a[i] = i;<br>

          }</p>

        <p>In this case, SCEV can happily compute the actual exit bound

          of the loop, and we could use the existing vector-body/scalar

          slow-path structure.  The only change needed would be to exit

          the vector body earlier.  (There are some details here, but

          it's almost as easy as I'm describing if my POC patch isn't

          missing something major.)</p>

        <p>There's a couple other cases like this.  I suspect we can get

          decent mileage out of just generalizing the existing code.  <br>

        </p>

        <p><br>

        </p>

        <p>Second, there are the cases where we actually have to handle

          iterations within the vector-body with predication (or

          speculation).  The good news is that there's already support

          in code for predication, we just need to add another source of

          potential predication.  The bad news is that's a fairly major

          change.  <br>

        </p>

        <p>Our challenge will be finding a runtime check for

          dereferenceability.  Consider the following example:<span

            data-sigil="slippery"><br>

            for (int i = 0; i < N; i++) <br>

              if (a[i] == 0) return false;<br>

            return true;</span></p>

        <p><span data-sigil="slippery">If 'a' is an alloca, or

            statically sized global, we can form a runtime check which

            ensures 'i' is in bounds and a speculative load will not

            fault.  <br>

          </span></p>

        <p>Here's a nice example we could handle with this approach.  <br>

        </p>

        <p>// Assume a and b are both statically sized.<br>

          <span data-sigil="slippery"><span data-sigil="slippery">for

              (int i = 0; i < N; i++) {<br>

                t = b[i];<br>

                if (t > M) throw();<br>

                sum += a[t];<br>

              }</span></span></p>

        <p><span data-sigil="slippery"><span data-sigil="slippery">(This

              is a classic a[b[i]] pattern, but with range checks

              added.)</span></span></p>

        <p><span data-sigil="slippery">This is broadly applicable enough

            to be useful, and in practice covers my use cases, but I'm

            hoping others have ideas for extensions here.</span><br>

          <span data-sigil="slippery"><span data-sigil="slippery"></span></span></p>

        Before resorting to that though, we have the potential to rely

        more on speculation safety reasoning.  I have a patch out for

        review currently (<span data-sigil="slippery"><span

            class="phui-oi-objname">D66688</span> <a

            href="https://reviews.llvm.org/D66688" class="phui-oi-link"

            title="[LoopVectorize] Leverage speculation safety to avoid

            masked.loads" moz-do-not-send="true">[LoopVectorize]

            Leverage speculation safety to avoid masked.loads</a>) which

          fell out of some prototyping in this area; it benefits

          existing predication logic, so I separated it.  <br>

        </span>

        <p><span data-sigil="slippery">The other major challenge here is

            profitability.  Consider a simple loop such as:</span></p>

        <p><span data-sigil="slippery">// assume a[0:N] is known

            dereferenceable<br>

            for (int i = 0; i < N; i++) <br>

              if (a[i] == 0) return false;<br>

            return true;</span></p>

        <p><span data-sigil="slippery">If N is large, and the array is

            non-zero, then this is profitable to vectorize.  If a[0] ==

            0, then it isn't, regardless of the value of N.  <br>

          </span></p>

        <p><span data-sigil="slippery">Figuring out when to vectorize vs

            not for cases like this will require some thought.  I don't

            really have a good answer for this yet, other than when the

            profile on the early exit tells us it's rarely taken.  <br>

          </span></p>

        <p><span data-sigil="slippery"><br>

          </span></p>

        <p><span data-sigil="slippery">Third, for both of the former

            cases, we need to be able to compute exit values along the

            early exit edges.  We can get a lot of mileage out of simply

            skipping loops with exit values (i.e. lcssa phis), as our

            loop exit value rewriting will tend to eliminate them. 

            However, we will eventually want to handle this case, which

            will require generating some interesting complicated code

            down the exit path to figure out which iteration actually

            exit.  <br>

          </span></p>

        <p><span data-sigil="slippery">I see two general options here:</span></p>

        <p><span data-sigil="slippery">1) Use the

            vector-body/scalar-body idiom of today, and have the vector

            body exit with IV = I when any iteration in [I, I+VF-1]

            would exit.  (i.e. roll back)  <br>

          </span></p>

        <p><span data-sigil="slippery">2) Insert dedicated exit blocks

            which recompute exit conditions to determine exact exit

            value, and then let the vector body run all iterations in VF

            which contain the exiting iteration.  (Assumes predicated

            stores, and that the exit blocks skip the scalar loop)  <br>

          </span></p>

        <p><span data-sigil="slippery">I currently don't have a reason

            to strongly prefer one over the other.  (2) is conceptually

            cleaner and the one which keeps coming up in conversation,

            but (1) may be simpler to actually implement.  <br>

          </span></p>

        <p><br>

        </p>

        <p><b>Current Plans</b></p>

        <p>At the current moment, I'm reasonable sure that I'm going to

          get the resources to at least tackle some of the cases where

          we bail out unnecessarily.  This will be a huge practical

          improvement in vectorizing robustness, at (I hope) relatively

          little implementation cost.  <br>

        </p>

        <p>I'm also going to continue playing around with enhancements

          to our current dereferenceability logic.  I see that as being

          a key building block to make any predication based approach

          practical.  <br>

        </p>

        <p>I'm not sure I'm going to get to the predication support. 

          I'd like to, but am not sure my resource constraints allow

          it.  I'll also mention that I'm not at all sure how all of

          this might fit in with the VPLAN work.  I'd really welcome

          feedback on that; is what I'm proposing at all consistent with

          others plans?</p>

        <p><br>

        </p>

        <p>Philip<br>

        </p>

      </blockquote>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <pre class="moz-quote-pre" wrap="">_______________________________________________

LLVM Developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

    </blockquote>

  </body>

</html>