<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Fri, Sep 14, 2018 at 4:39 PM Philip Reames via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div text="#000000" bgcolor="#FFFFFF">
    I agree this approach is possible.  I even have an (ancient,
    abandoned) patch which started down that path. 
    <a class="m_3669519156898967894moz-txt-link-freetext" href="https://reviews.llvm.org/D7061" target="_blank">https://reviews.llvm.org/D7061</a><br>
    <br>
    I view this as being also useful, not a reason not to do the
    transform in LICM.  We've already spent all the analysis cost, we
    might as well use it.  <br>
    <br>
    (Lest it's not clear, I'm assuming that if we did do a separate
    LoopPRE pass that we'd have to separate out a AliasSetTrackAnalysis
    and handle invalidation properly though the loop pass pipeline. 
    i.e. not recompute the AST)<br></div></blockquote><div><br></div><div>I was suggesting dropping the use of the AST entirely :). Not going to happen tomorrow, but at some point...</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div text="#000000" bgcolor="#FFFFFF">
    <br>
    <div class="m_3669519156898967894moz-cite-prefix">On 09/14/2018 04:56 AM, John Brawn
      wrote:<br>
    </div>
    <blockquote type="cite">
      
      
      
      <div class="m_3669519156898967894WordSection1">
        <p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">For
            doing PRE on the load, it looks like there’s only two things
            stopping GVN PRE from doing it:<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">*
            GVN::PerformLoadPRE doesn’t hoist loads that are
            conditional. Probably this can be overcome with some kind of<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">   
             heuristic that allows it to happen in loops when the blocks
            where a load would have to be inserted are outside<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">    
            the loop.<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">*
            IsFullyAvailableInBlock goes around the loop and
            double-counts the entry-edge into the loop as unavailable.
            I’m<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">    
            pretty sure this can be easily fixed by marking the block
            that PerformLoadPRE calls IsFullyAvailableInBlock on as<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">    
            FullyAvailable, so it stops there when following the
            back-edge.<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">I
            did a quick experiment (solving the first problem by just
            commenting out that check) and this worked for a simple<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">test
            case of<u></u><u></u></span></p>
        <p class="MsoNormal" style="text-autospace:none"><span>int x;<u></u><u></u></span></p>
        <p class="MsoNormal" style="text-autospace:none"><span>int arr[32];<u></u><u></u></span></p>
        <p class="MsoNormal" style="text-autospace:none"><span>void fn(int n) {<u></u><u></u></span></p>
        <p class="MsoNormal" style="text-autospace:none"><span>   for (int i = 0; i
            < n; i++) {<u></u><u></u></span></p>
        <p class="MsoNormal" style="text-autospace:none"><span>     if (arr[i]) {<u></u><u></u></span></p>
        <p class="MsoNormal" style="text-autospace:none"><span>       x++;<u></u><u></u></span></p>
        <p class="MsoNormal" style="text-autospace:none"><span>     }<u></u><u></u></span></p>
        <p class="MsoNormal" style="text-autospace:none"><span>   }<u></u><u></u></span></p>
        <p class="MsoNormal"><span>}</span><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u><u></u></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">So
            perhaps the load PRE half can be done without a separate
            pass.<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">John<u></u><u></u></span></p>
        <p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
        <div style="border:none;border-left:solid blue 1.5pt;padding:0cm 0cm 0cm 4.0pt">
          <div>
            <div style="border:none;border-top:solid #b5c4df 1.0pt;padding:3.0pt 0cm 0cm 0cm">
              <p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif"" lang="EN-US">From:</span></b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif"" lang="EN-US"> llvm-dev
                  [<a class="m_3669519156898967894moz-txt-link-freetext" href="mailto:llvm-dev-bounces@lists.llvm.org" target="_blank">mailto:llvm-dev-bounces@lists.llvm.org</a>]
                  <b>On Behalf Of </b>Alina Sbirlea via llvm-dev<br>
                  <b>Sent:</b> 13 September 2018 22:01<br>
                  <b>To:</b> <a class="m_3669519156898967894moz-txt-link-abbreviated" href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a><br>
                  <b>Cc:</b> <a class="m_3669519156898967894moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
                  <b>Subject:</b> Re: [llvm-dev] Generalizing load/store
                  promotion in LICM<u></u><u></u></span></p>
            </div>
          </div>
          <p class="MsoNormal"><u></u> <u></u></p>
          <div>
            <p class="MsoNormal">For context, I've been looking into
              replacing the use of AST (AliasSetTracker) with MemorySSA
              in LICM. This works great for sinking/hoisting but does
              not apply well for promotion, and one of the solutions I
              considered is precisely ripping out the promotion part and
              replacing its functionality with a separate PRE pass +
              possibly store sinking. FWIW I think that's the right way
              to go.<u></u><u></u></p>
            <div>
              <p class="MsoNormal">I did not get deep enough into
                working on the solution but I would gladly have a
                detailed discussion to move this forward.<u></u><u></u></p>
            </div>
            <div>
              <p class="MsoNormal"><u></u> <u></u></p>
            </div>
            <div>
              <p class="MsoNormal">Reading into detail now.<u></u><u></u></p>
            </div>
            <div>
              <p class="MsoNormal"><u></u> <u></u></p>
            </div>
            <div>
              <p class="MsoNormal">Thanks,<u></u><u></u></p>
            </div>
            <div>
              <p class="MsoNormal">Alina<u></u><u></u></p>
            </div>
          </div>
          <p class="MsoNormal"><u></u> <u></u></p>
          <div>
            <div>
              <p class="MsoNormal">On Thu, Sep 13, 2018 at 1:43 PM
                Philip Reames <<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>>
                wrote:<u></u><u></u></p>
            </div>
            <blockquote style="border:none;border-left:solid #cccccc 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
              <div>
                <p class="MsoNormal">(minor inline additions)<br>
                  <br>
                  On 09/13/2018 01:51 AM, Chandler Carruth wrote:<br>
                  <br>
                  <u></u><u></u></p>
                <div>
                  <p class="MsoNormal">Haven't had time to dig into
                    this, but wanted to add <a href="mailto:asbirlea@google.com" id="m_3669519156898967894m_-2310710337215009520IloFPc-1" target="_blank">+Alina Sbirlea</a> to the
                    thread as she has been working on promotion and
                    other aspects of LICM for a long time here.<u></u><u></u></p>
                </div>
                <p class="MsoNormal">Thanks!<br>
                  <br>
                  <u></u><u></u></p>
                <div>
                  <div>
                    <p class="MsoNormal">On Wed, Sep 12, 2018 at 11:41
                      PM Philip Reames <<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>>
                      wrote:<u></u><u></u></p>
                  </div>
                  <blockquote style="border:none;border-left:solid #cccccc 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
                    <div>
                      <p>I'm thinking about making some semi radical
                        changes to load store promotion works in LICM,
                        and figured it would be a good idea to get buy
                        in before I actually started writing code.  :)<u></u><u></u></p>
                      <p>TLDR: legality of sinking stores to exits is
                        hard, can we separate load handling into a super
                        aggressive form of PRE, and use predicated
                        stores to avoid solving legality question?<u></u><u></u></p>
                      <p><u></u> <u></u></p>
                      <p>Background<u></u><u></u></p>
                      <p>We've been seeing an interesting class of
                        problems recently that looks roughly like this:<u></u><u></u></p>
                      <p>for (int = 0; i < N; i++)<br>
                          if (a[i] == 0) // some data dependent check<br>
                            g_count++; // conditional load and store to
                        shared location<u></u><u></u></p>
                      <p>The critical detail in this example is that
                        g_count is a global location which may be
                        accessed concurrently* by multiple threads.  The
                        challenge here is that we don't know whether the
                        store ever executes, and we're not allowed to
                        insert a store along any path that didn't
                        originally contain them.  Said differently, if
                        all elements in "a" are non-zero, we're not
                        allowed to store to g_count.  We do know that
                        the g_count location is dereferenceable though. 
                        <u></u><u></u></p>
                      <p>(*Please, let's avoid the memory model
                        derailment here.  I'm simplifying and C++
                        language rules aren't real useful for my Java
                        language frontend anyways.  In practice, all the
                        access are atomic, but unordered, but we'll
                        leave that out of discussion otherwise.)<u></u><u></u></p>
                      <p>I have two approaches I'm considering here. 
                        These are orthogonal, but I suspect we'll want
                        to implement both.<u></u><u></u></p>
                      <p><u></u> <u></u></p>
                      <p>Proposal A - Use predicated stores in loop
                        exits<u></u><u></u></p>
                      <p>The basic idea is that we don't worry about
                        solving the legality question above, and just
                        insert a store which is predicated on a
                        condition which is true exactly when the
                        original store ran.  In pseudo code, this looks
                        something like:<u></u><u></u></p>
                      <p>bool StoreLegal = false;<br>
                        int LocalCount = g_count;<br>
                        for (int = 0; i < N; i++)<br>
                          if (a[i] == 0) {<br>
                            LocalCount++;<br>
                            StoreLegal = true;<br>
                          }<br>
                        if (StoreLegal) g_count = LocalCount; <u></u><u></u></p>
                      <p>There are two obvious concerns here:<u></u><u></u></p>
                      <ol start="1" type="1">
                        <li class="MsoNormal">
                          The predicated store might be expensive in
                          practice - true for most current
                          architectures.<u></u><u></u></li>
                        <li class="MsoNormal">
                          We''re introducing an extra boolean phi cycle
                          around the loop.  <u></u><u></u></li>
                      </ol>
                      <p>Talking to a couple of folks offline at the
                        socials over the last few months, the former
                        seems to be the main objection.  I think we can
                        control this by restricting this transform to
                        when the loop is believed to have a high trip
                        count and the conditional path is taken with
                        some frequency.   Getting feedback on this
                        particular point is one of my main reasons for
                        writing this mail. 
                        <u></u><u></u></p>
                      <p>The second objection can frequently be resolved
                        by finding other expressions which evaluate to
                        the same boolean.  (In this case, if LocalCount
                        != LocalCountOrig assuming i doesn't overflow.) 
                        We already have a framework with SCEV to do
                        these replacements.  Though, from some quick
                        testing, it would definitely need
                        strengthening.  However, SCEV can't remove the
                        extra phi in all cases, so we have to be okay
                        with the extra phi cycle in the general case. 
                        This seems worthwhile to me, but thoughts?<u></u><u></u></p>
                      <p><u></u> <u></u></p>
                      <p>Proposal B - Separate load and store handling
                        into distinct transforms<u></u><u></u></p>
                      <p>(For folks I've discussed this with before,
                        this part is all new.)<u></u><u></u></p>
                      <p>Thinking about the problem, it finally occurred
                        to me that we can decompose the original example
                        into two steps: getting the loads out of the
                        loop, and sinking the stores out of the loop. 
                        If we can accomplish the former, but not the
                        later, we've still made meaningful progress.  <u></u><u></u></p>
                      <p>So, what'd we'd essentially have is a load only
                        transformation which produces this:<br>
                        int LocalCount = g_count;<br>
                        for (int = 0; i < N; i++)<br>
                          if (a[i] == 0) {<br>
                            LocalCount++;<br>
                            g_count = LocalCount;<br>
                          }<u></u><u></u></p>
                      <p>At this point, we've reduced memory traffic by
                        half, and opened up the possibility that other
                        parts of the optimizer can exploit the
                        simplified form.  The last point is particular
                        interesting since we generally try to
                        canonicalize loads out of loops, and much of the
                        optimizer is tuned for a form with as much as
                        possible being loop invariant.  As one example,
                        completely by accident, there's some work going
                        on in the LoopVectorizer right now to handle
                        such stores to loop invariant addresses during
                        vectorization.  Putting the two pieces together
                        would let us fully vectorize this loop without
                        needing to sink the stores at all.  
                        <u></u><u></u></p>
                      <p>In practice, the existing implementation in
                        LICM would cleanly split along these lines with
                        little problem. 
                        <u></u><u></u></p>
                      <p>One way of looking at this load specific
                        transform is as an extreme form of PRE (partial
                        redundancy elimination).  Our current PRE
                        implementation doesn't handle cases this
                        complicated.
                        <u></u><u></u></p>
                    </div>
                  </blockquote>
                </div>
                <p class="MsoNormal">It occurred to my later that simply
                  framing the new transform as a separate pass (LoopPRE)
                  and using the same AST + SSA construction approach
                  would be straight forward.  So, if folks think that
                  having an aggressive form of load PRE in LICM is going
                  a bit too far, it'd be easy to represent as an
                  optional separate pass.  I'd still prefer having LICM
                  contain the logic though. 
                  <br>
                  <br>
                  <u></u><u></u></p>
                <div>
                  <blockquote style="border:none;border-left:solid #cccccc 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
                    <div>
                      <p>Thoughts?<u></u><u></u></p>
                    </div>
                    <div>
                      <p>Philip<u></u><u></u></p>
                    </div>
                  </blockquote>
                </div>
                <p class="MsoNormal"><u></u> <u></u></p>
              </div>
            </blockquote>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
  </div>

_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote></div></div>