[LLVMdev] SROA is slow when compiling a large basic block

Thu May 15 09:31:40 PDT 2014

On 05/14/2014 06:02 PM, Akira Hatanaka wrote:
> I would like to get feedback from the community on how I can speed up 
> the compilation of a function that has one huge basic block consisting 
> of over 150K instructions. It takes about 10 minutes for clang to 
> produce the object file and the function that is taking up the 
> majority of the time is LoadAndStorePromoter::run, which is called 
> when SROA is run:
>
>     // Otherwise, we have mixed loads and stores (or just a bunch of 
> stores).
>
>     // Since SSAUpdater is purely for cross-block values, we need to 
> determine
>
>     // the order of these instructions in the block.  If the first use 
> in the
>
>     // block is a load, then it uses the live in value.  The last 
> store defines
>
>     // the live out value.  We handle this by doing a linear scan of 
> the block.
>
> Value *StoredValue = nullptr;
>
>     for (BasicBlock::iterator II = BB->begin(), E = BB->end(); II != 
> E; ++II) {
>
>
> If I understand this code correctly. LoadAndStorePromoter::run is 
> called once per every promotable alloca and iterates over the whole 
> list to determine the order of loads and stores in the basic block 
> that access the alloca.
>
I'm a bit confused about the cost here.  Based on the code, I see us 
walking each basic block at most once per alloca.  Even at 150K 
instructions, I wouldn't expect the loop to take *that* long.  How many 
allocas are you dealing with?

I do see an outer loop in SROA::performPromotion.  Are you possibly 
triggering a large number of iterations there?  That would make the 
inner loop appear much more expensive.

>
> This is the list of ideas I have considered or implemented that can 
> possibly solve my problem:
>
>
> 1. In SROA::getAnalysisUsage, always require DominatorTreeWrapperPass. 
> This will enable SROA::promoteAllocas to use mem2reg, which is fast 
> because it caches the per basic-block ordering of the relevant loads 
> and stores. If it's important to avoid always computing the dominator 
> tree, computing it conditionally based on whether there is a huge 
> basic block in the function is another idea, but I am not sure if that 
> is possible (I don't think this is currently supported).
>
>
> This brings down the compilation time (using clang -emit-llvm) from 
> 350s to 30s (it still takes about 23s to do GVN). It also might fix 
> PR17855 (the program that used to take 65s to compile now takes just 11s):
>
>
> http://llvm.org/bugs/show_bug.cgi?id=17855
>
Chandler commented here, I'll defer to him.
>
>
> 2. Cache the ordering of loads and stores in 
> LoadAndStorePromoter::run. I don't know how hard it would be to 
> implement this, but I think it would be as fast as 1 (using mem2reg).
>
Caching the first load and first store for each alloca in a single pass 
doesn't seem too bad.  (Assuming that that doesn't change across 
iterations at least - I'm not particularly familiar with this code.)  
This seems like a fairly isolated improvement.  Assuming Chandler 
doesn't disagree, I'd say this seems like a reasonable first step.
>
>
> 3. Insert a pass that splits huge basic blocks into smaller blocks 
> prior to SROA and recombine them after the pass is run.
>
How would this actually help?  If you have uses in each of the split 
blocks, wouldn't you spend the exact same time?
>
>
> 4. Change the definition of class Instruction in IR/Instruction.h so 
> that determining whether one instruction precedes another becomes more 
> efficient. My current implementation brings down the compilation time 
> to just 6s.
>
>
> I won't go into much detail about the change I made, but it basically 
> replaces Instruction's pointer to its parent BasicBlock with another 
> data structure. I haven't done much investigation into how this will 
> affect the common case.
>
>
> This will also help speed up the execution of code in other places. 
> For example, these two functions are called when GVN::processLoad is 
> called and iterate over a basic block's instruction list to determine 
> the order of two (load and stores) instructions.:
>
>
> llvm::isPotentiallyReachable
>
>
>     // Linear scan, start at 'A', see whether we hit 'B' or the end first.
>
>     for (BasicBlock::const_iterator I = A, E = BB->end(); I != E; ++I) {
>
>   if (&*I == B)
>
>   return true;
>
>     }
>
>
> DominatorTree::dominates
>
>   // Loop through the basic block until we find Def or User.
>
>   BasicBlock::const_iterator I = DefBB->begin();
>
>   for (; &*I != Def && &*I != User; ++I)
>
>     /*empty*/;
>
I'd be curious to see a proposal here.  I've been bitten by the 
DT::dominates issue myself.
>
> 5. Simply give up doing SROA (and GVN::processLoad) if there is a huge 
> basic block that can potentially take a long time to compile.
>
Until we have something better, this might be a necessary hack. Whether 
it's actually worth introducing depends on how long Chandler thinks the 
mem2reg version will take to land.

Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140515/99b3fcb4/attachment.html>