[LLVMdev] SROA is slow when compiling a large basic block

Wed May 21 15:05:47 PDT 2014

On 05/15/2014 11:12 AM, Akira Hatanaka wrote:
> On Thu, May 15, 2014 at 9:31 AM, Philip Reames 
> <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote:
>
>     On 05/14/2014 06:02 PM, Akira Hatanaka wrote:
>>     I would like to get feedback from the community on how I can
>>     speed up the compilation of a function that has one huge basic
>>     block consisting of over 150K instructions. It takes about 10
>>     minutes for clang to produce the object file and the function
>>     that is taking up the majority of the time is
>>     LoadAndStorePromoter::run, which is called when SROA is run:
>>
>>       // Otherwise, we have mixed loads and stores (or just a bunch
>>     of stores).
>>
>>       // Since SSAUpdater is purely for cross-block values, we need
>>     to determine
>>
>>       // the order of these instructions in the block.  If the first
>>     use in the
>>
>>       // block is a load, then it uses the live in value.  The last
>>     store defines
>>
>>       // the live out value.  We handle this by doing a linear scan
>>     of the block.
>>
>>       Value *StoredValue = nullptr;
>>
>>       for (BasicBlock::iterator II = BB->begin(), E = BB->end(); II
>>     != E; ++II) {
>>
>>
>>     If I understand this code correctly. LoadAndStorePromoter::run is
>>     called once per every promotable alloca and iterates over the
>>     whole list to determine the order of loads and stores in the
>>     basic block that access the alloca.
>>
>     I'm a bit confused about the cost here.  Based on the code, I see
>     us walking each basic block at most once per alloca.  Even at 150K
>     instructions, I wouldn't expect the loop to take *that* long.  How
>     many allocas are you dealing with?
>
>     I do see an outer loop in SROA::performPromotion.  Are you
>     possibly triggering a large number of iterations there?  That
>     would make the inner loop appear much more expensive.
>
>
> The unoptimized bitcode (compiled with clang -O0) has a large number 
> of allocas (over 7000), which I think is also contributing to the long 
> compilation time.
I suspect that many things in the optimizer aren't going to react well 
to that many allocas.  :)  Where are they all coming from?  And why 
hasn't mem2reg removed most of them?  I suspect you'd be better off 
investing effort to address this issue then optimizing SROA.
>
> But making LoadAndStorePromoter::run faster is sufficient to reduce 
> the compilation time (at least in my case).
I'm not objecting to this.  I'd be happy to see something along these 
lines land, I'm just not sure it solves the root cause of your issues.
>  ...
>
>>     4. Change the definition of class Instruction in IR/Instruction.h
>>     so that determining whether one instruction precedes another
>>     becomes more efficient. My current implementation brings down the
>>     compilation time to just 6s.
>>
>>
>>     I won't go into much detail about the change I made, but it
>>     basically replaces Instruction's pointer to its parent BasicBlock
>>     with another data structure. I haven't done much investigation
>>     into how this will affect the common case.
>>
>>
>>     This will also help speed up the execution of code in other
>>     places. For example, these two functions are called when
>>     GVN::processLoad is called and iterate over a basic block's
>>     instruction list to determine the order of two (load and stores)
>>     instructions.:
>>
>>
>>     llvm::isPotentiallyReachable
>>
>>
>>       // Linear scan, start at 'A', see whether we hit 'B' or the end
>>     first.
>>
>>       for (BasicBlock::const_iterator I = A, E = BB->end(); I != E;
>>     ++I) {
>>
>>         if (&*I == B)
>>
>>           return true;
>>
>>         }
>>
>>
>>     DominatorTree::dominates
>>
>>       // Loop through the basic block until we find Def or User.
>>
>>       BasicBlock::const_iterator I = DefBB->begin();
>>
>>       for (; &*I != Def && &*I != User; ++I)
>>
>>         /*empty*/;
>>
>     I'd be curious to see a proposal here.  I've been bitten by the
>     DT::dominates issue myself.
>>
> I was hesitant to make changes in classes such as BasicBlock and 
> Instruction just to make it faster to compile a basic block with 150K+ 
> instructions (which I think is a pathological case). If there are 
> other use cases that need discovering orders of instructions in a 
> basic block, it might be worth considering fixing those classes.
> The attached patch is the current implementation I have. Basically I 
> am trying to make function Instruction::precede fast and use it to 
> speed up functions that are currently iterating over the whole 
> instruction list to determine the order of instructions in a basic block.
I get where you're going with this, but I don't think this is the right 
approach.  In particular, you're introducing extra indirections along 
some fairly common paths and the caching scheme around sublist is 
non-obvious.

I do like the extraction of the 'preceeds' interface.  It gives us a 
single place to optimize.

Looking at the problem, I don't see an easy answer here.  If we weren't 
using linked lists, this would be easy, but with that constraint, 
everything I come up with seems nasty.

One somewhat less nasty (but still bad) scheme I came up with would be 
to use the traits of the InstList in basic block to create a listener 
which ignores all updates except when order is interesting (i.e. inside 
the SROA code).  This seems really bug prone though.  I am not in favour 
of actually implementing such a thing.

Another scheme might be to have an Analysis which caches the answer 
"does X proceed Y".  By default, nothing would preserve this, and we 
could teach certain passes to be aware of it.  Again, this feels dirty.

I don't really have a good solution here.

>
>>     5. Simply give up doing SROA (and GVN::processLoad) if there is a
>>     huge basic block that can potentially take a long time to compile.
>>
>     Until we have something better, this might be a necessary hack. 
>     Whether it's actually worth introducing depends on how long
>     Chandler thinks the mem2reg version will take to land.
>
>
> I tried this idea today, and unfortunately it didn't reduce the 
> compilation time. Skipping SROA leaves a lot of loads/stores in the 
> code that could have been promoted, and that places the burden on 
> other passes that are run later, such as GVN and DSE.
Makes some sense.  See my point above about lots of allocas stressing 
the optimizer.  :)

p.s. Any chance of bugs being filed with test cases for each of those?  :)

Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140521/f647a5c9/attachment.html>