[LLVMdev] SROA is slow when compiling a large basic block
Philip Reames
listmail at philipreames.com
Thu May 15 09:31:40 PDT 2014
On 05/14/2014 06:02 PM, Akira Hatanaka wrote:
> I would like to get feedback from the community on how I can speed up
> the compilation of a function that has one huge basic block consisting
> of over 150K instructions. It takes about 10 minutes for clang to
> produce the object file and the function that is taking up the
> majority of the time is LoadAndStorePromoter::run, which is called
> when SROA is run:
>
> // Otherwise, we have mixed loads and stores (or just a bunch of
> stores).
>
> // Since SSAUpdater is purely for cross-block values, we need to
> determine
>
> // the order of these instructions in the block. If the first use
> in the
>
> // block is a load, then it uses the live in value. The last
> store defines
>
> // the live out value. We handle this by doing a linear scan of
> the block.
>
> Value *StoredValue = nullptr;
>
> for (BasicBlock::iterator II = BB->begin(), E = BB->end(); II !=
> E; ++II) {
>
>
> If I understand this code correctly. LoadAndStorePromoter::run is
> called once per every promotable alloca and iterates over the whole
> list to determine the order of loads and stores in the basic block
> that access the alloca.
>
I'm a bit confused about the cost here. Based on the code, I see us
walking each basic block at most once per alloca. Even at 150K
instructions, I wouldn't expect the loop to take *that* long. How many
allocas are you dealing with?
I do see an outer loop in SROA::performPromotion. Are you possibly
triggering a large number of iterations there? That would make the
inner loop appear much more expensive.
>
> This is the list of ideas I have considered or implemented that can
> possibly solve my problem:
>
>
> 1. In SROA::getAnalysisUsage, always require DominatorTreeWrapperPass.
> This will enable SROA::promoteAllocas to use mem2reg, which is fast
> because it caches the per basic-block ordering of the relevant loads
> and stores. If it's important to avoid always computing the dominator
> tree, computing it conditionally based on whether there is a huge
> basic block in the function is another idea, but I am not sure if that
> is possible (I don't think this is currently supported).
>
>
> This brings down the compilation time (using clang -emit-llvm) from
> 350s to 30s (it still takes about 23s to do GVN). It also might fix
> PR17855 (the program that used to take 65s to compile now takes just 11s):
>
>
> http://llvm.org/bugs/show_bug.cgi?id=17855
>
Chandler commented here, I'll defer to him.
>
>
> 2. Cache the ordering of loads and stores in
> LoadAndStorePromoter::run. I don't know how hard it would be to
> implement this, but I think it would be as fast as 1 (using mem2reg).
>
Caching the first load and first store for each alloca in a single pass
doesn't seem too bad. (Assuming that that doesn't change across
iterations at least - I'm not particularly familiar with this code.)
This seems like a fairly isolated improvement. Assuming Chandler
doesn't disagree, I'd say this seems like a reasonable first step.
>
>
> 3. Insert a pass that splits huge basic blocks into smaller blocks
> prior to SROA and recombine them after the pass is run.
>
How would this actually help? If you have uses in each of the split
blocks, wouldn't you spend the exact same time?
>
>
> 4. Change the definition of class Instruction in IR/Instruction.h so
> that determining whether one instruction precedes another becomes more
> efficient. My current implementation brings down the compilation time
> to just 6s.
>
>
> I won't go into much detail about the change I made, but it basically
> replaces Instruction's pointer to its parent BasicBlock with another
> data structure. I haven't done much investigation into how this will
> affect the common case.
>
>
> This will also help speed up the execution of code in other places.
> For example, these two functions are called when GVN::processLoad is
> called and iterate over a basic block's instruction list to determine
> the order of two (load and stores) instructions.:
>
>
> llvm::isPotentiallyReachable
>
>
> // Linear scan, start at 'A', see whether we hit 'B' or the end first.
>
> for (BasicBlock::const_iterator I = A, E = BB->end(); I != E; ++I) {
>
> if (&*I == B)
>
> return true;
>
> }
>
>
> DominatorTree::dominates
>
> // Loop through the basic block until we find Def or User.
>
> BasicBlock::const_iterator I = DefBB->begin();
>
> for (; &*I != Def && &*I != User; ++I)
>
> /*empty*/;
>
I'd be curious to see a proposal here. I've been bitten by the
DT::dominates issue myself.
>
> 5. Simply give up doing SROA (and GVN::processLoad) if there is a huge
> basic block that can potentially take a long time to compile.
>
Until we have something better, this might be a necessary hack. Whether
it's actually worth introducing depends on how long Chandler thinks the
mem2reg version will take to land.
Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140515/99b3fcb4/attachment.html>
More information about the llvm-dev
mailing list