<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 05/14/2014 06:02 PM, Akira Hatanaka

      wrote:<br>

    </div>

    <blockquote

cite="mid:CAB297QxdgMvPOuuD9SKwbcdVxQo6iKDpk+TzmqruXUGiEtWJZg@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div>I would like to get feedback from the community on how I

          can speed up the compilation of a function that has one huge

          basic block consisting of over 150K instructions. It takes

          about 10 minutes for clang to produce the object file and the

          function that is taking up the majority of the time is

          LoadAndStorePromoter::run, which is called when SROA is run:</div>

        <div><br>

        </div>

        <div>

          <p style="margin:0px;font-size:11px;font-family:Menlo">    //

            Otherwise, we have mixed loads and stores (or just a bunch

            of stores).</p>

          <p style="margin:0px;font-size:11px;font-family:Menlo">    //

            Since SSAUpdater is purely for cross-block values, we need

            to determine</p>

          <p style="margin:0px;font-size:11px;font-family:Menlo">    //

            the order of these instructions in the block.  If the first

            use in the</p>

          <p style="margin:0px;font-size:11px;font-family:Menlo">    //

            block is a load, then it uses the live in value.  The last

            store defines</p>

          <p style="margin:0px;font-size:11px;font-family:Menlo">    //

            the live out value.  We handle this by doing a linear scan

            of the block.</p>

          <p style="margin:0px;font-size:11px;font-family:Menlo">   

            Value *StoredValue = nullptr;</p>

          <p style="margin:0px;font-size:11px;font-family:Menlo">    for

            (BasicBlock::iterator II = BB->begin(), E = BB->end();

            II != E; ++II) {</p>

          <p style="margin:0px;font-size:11px;font-family:Menlo"><br>

          </p>

          <p style="margin:0px">

            If I understand this code

            correctly. LoadAndStorePromoter::run is called once per

            every promotable alloca and iterates over the whole list to

            determine the order of loads and stores in the basic block

            that access the alloca.</p>

        </div>

      </div>

    </blockquote>

    I'm a bit confused about the cost here.  Based on the code, I see us

    walking each basic block at most once per alloca.  Even at 150K

    instructions, I wouldn't expect the loop to take *that* long.  How

    many allocas are you dealing with?<br>

    <br>

    I do see an outer loop in SROA::performPromotion.  Are you possibly

    triggering a large number of iterations there?  That would make the

    inner loop appear much more expensive.  <br>

    <br>

    <blockquote

cite="mid:CAB297QxdgMvPOuuD9SKwbcdVxQo6iKDpk+TzmqruXUGiEtWJZg@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div>

          <p style="margin:0px"><br>

          </p>

          <p style="margin:0px">This is the list of ideas I have

            considered or implemented that can possibly solve my

            problem:</p>

          <p style="margin:0px"><br>

          </p>

          <p style="margin:0px">1. In SROA::getAnalysisUsage, always

            require DominatorTreeWrapperPass. This will enable

            SROA::promoteAllocas to use mem2reg, which is fast because

            it caches the per basic-block ordering of the relevant loads

            and stores. If it's important to avoid always computing the

            dominator tree, computing it conditionally based on whether

            there is a huge basic block in the function is another idea,

            but I am not sure if that is possible (I don't think this is

            currently supported).</p>

          <p style="margin:0px"><br>

          </p>

          <p style="margin:0px">This brings down the compilation time

            (using clang -emit-llvm) from 350s to 30s (it still takes

            about 23s to do GVN). It also might fix PR17855 (the program

            that used to take 65s to compile now takes just 11s):</p>

          <p style="margin:0px"><br>

          </p>

          <p style="margin:0px"><a moz-do-not-send="true"

              href="http://llvm.org/bugs/show_bug.cgi?id=17855"

              target="_blank">http://llvm.org/bugs/show_bug.cgi?id=17855</a><br>

          </p>

        </div>

      </div>

    </blockquote>

    Chandler commented here, I'll defer to him.  <br>

    <blockquote

cite="mid:CAB297QxdgMvPOuuD9SKwbcdVxQo6iKDpk+TzmqruXUGiEtWJZg@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div>

          <p style="margin:0px"><br>

          </p>

          <p style="margin:0px">

            2. Cache the ordering of loads and stores in

            LoadAndStorePromoter::run. I don't know how hard it would be

            to implement this, but I think it would be as fast as 1

            (using mem2reg).</p>

        </div>

      </div>

    </blockquote>

    Caching the first load and first store for each alloca in a single

    pass doesn't seem too bad.  (Assuming that that doesn't change

    across iterations at least - I'm not particularly familiar with this

    code.)  This seems like a fairly isolated improvement.  Assuming

    Chandler doesn't disagree, I'd say this seems like a reasonable

    first step.  <br>

    <blockquote

cite="mid:CAB297QxdgMvPOuuD9SKwbcdVxQo6iKDpk+TzmqruXUGiEtWJZg@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div>

          <p style="margin:0px"><br>

          </p>

          <p style="margin:0px">

            3. Insert a pass that splits huge basic blocks into smaller

            blocks prior to SROA and recombine them after the pass is

            run.</p>

        </div>

      </div>

    </blockquote>

    How would this actually help?  If you have uses in each of the split

    blocks, wouldn't you spend the exact same time?  <br>

    <blockquote

cite="mid:CAB297QxdgMvPOuuD9SKwbcdVxQo6iKDpk+TzmqruXUGiEtWJZg@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div>

          <p style="margin:0px"><br>

          </p>

          <p style="margin:0px">4. Change the definition of class

            Instruction in IR/Instruction.h so that determining whether

            one instruction precedes another becomes more efficient. My

            current implementation brings down the compilation time to

            just 6s.</p>

          <p style="margin:0px"><br>

          </p>

          <p style="margin:0px">I won't go into much detail about the

            change I made, but it basically replaces Instruction's

            pointer to its parent BasicBlock with another data

            structure. I haven't done much investigation into how this

            will affect the common case.</p>

          <p style="margin:0px"><br>

          </p>

          <div>This will also help speed up the execution of code in

            other places. For example, these two functions are called

            when GVN::processLoad is called and iterate over a basic

            block's instruction list to determine the order of two (load

            and stores) instructions.:</div>

          <p style="margin:0px"><br>

          </p>

          <p style="margin:0px;font-size:11px;font-family:Menlo"><span

              style="font-family:arial;font-size:small">llvm::isPotentiallyReachable</span></p>

          <p style="margin:0px;font-size:11px;font-family:Menlo">

            <br>

          </p>

          <p style="margin:0px;font-size:11px;font-family:Menlo">    //

            Linear scan, start at 'A', see whether we hit 'B' or the end

            first.</p>

          <p style="margin:0px;font-size:11px;font-family:Menlo">    for

            (BasicBlock::const_iterator I = A, E = BB->end(); I != E;

            ++I) {</p>

          <p style="margin:0px;font-size:11px;font-family:Menlo">   

              if (&*I == B)</p>

          <p style="margin:0px;font-size:11px;font-family:Menlo">     

              return true;</p>

          <p style="margin:0px;font-size:11px;font-family:Menlo">

                }</p>

          <div><br>

          </div>

          <div>DominatorTree::dominates</div>

          <div><br>

          </div>

          <div>

            <p style="margin:0px;font-size:11px;font-family:Menlo">  //

              Loop through the basic block until we find Def or User.</p>

            <p style="margin:0px;font-size:11px;font-family:Menlo">

                BasicBlock::const_iterator I = DefBB->begin();</p>

            <p style="margin:0px;font-size:11px;font-family:Menlo">  for

              (; &*I != Def && &*I != User; ++I)</p>

            <p style="margin:0px;font-size:11px;font-family:Menlo">

                  /*empty*/;</p>

          </div>

        </div>

      </div>

    </blockquote>

    I'd be curious to see a proposal here.  I've been bitten by the

    DT::dominates issue myself.  <br>

    <blockquote

cite="mid:CAB297QxdgMvPOuuD9SKwbcdVxQo6iKDpk+TzmqruXUGiEtWJZg@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div>

          <div><br>

          </div>

          <p style="margin:0px">5. Simply give up doing SROA (and

            GVN::processLoad) if there is a huge basic block that can

            potentially take a long time to compile.</p>

        </div>

      </div>

    </blockquote>

    Until we have something better, this might be a necessary hack. 

    Whether it's actually worth introducing depends on how long Chandler

    thinks the mem2reg version will take to land.  <br>

    <br>

    Philip<br>

    <br>

  </body>

</html>