[LLVMdev] SROA is slow when compiling a large basic block

Thu May 15 11:12:48 PDT 2014

On Thu, May 15, 2014 at 9:31 AM, Philip Reames <listmail at philipreames.com>wrote:

>  On 05/14/2014 06:02 PM, Akira Hatanaka wrote:
>
>  I would like to get feedback from the community on how I can speed up
> the compilation of a function that has one huge basic block consisting of
> over 150K instructions. It takes about 10 minutes for clang to produce the
> object file and the function that is taking up the majority of the time is
> LoadAndStorePromoter::run, which is called when SROA is run:
>
>      // Otherwise, we have mixed loads and stores (or just a bunch of
> stores).
>
>     // Since SSAUpdater is purely for cross-block values, we need to
> determine
>
>     // the order of these instructions in the block.  If the first use in
> the
>
>     // block is a load, then it uses the live in value.  The last store
> defines
>
>     // the live out value.  We handle this by doing a linear scan of the
> block.
>
>     Value *StoredValue = nullptr;
>
>     for (BasicBlock::iterator II = BB->begin(), E = BB->end(); II != E;
> ++II) {
>
>
>  If I understand this code correctly. LoadAndStorePromoter::run is called
> once per every promotable alloca and iterates over the whole list to
> determine the order of loads and stores in the basic block that access the
> alloca.
>
> I'm a bit confused about the cost here.  Based on the code, I see us
> walking each basic block at most once per alloca.  Even at 150K
> instructions, I wouldn't expect the loop to take *that* long.  How many
> allocas are you dealing with?
>
> I do see an outer loop in SROA::performPromotion.  Are you possibly
> triggering a large number of iterations there?  That would make the inner
> loop appear much more expensive.
>
>
The unoptimized bitcode (compiled with clang -O0) has a large number of
allocas (over 7000), which I think is also contributing to the long
compilation time.

But making LoadAndStorePromoter::run faster is sufficient to reduce the
compilation time (at least in my case).

>
>  This is the list of ideas I have considered or implemented that can
> possibly solve my problem:
>
>
>  1. In SROA::getAnalysisUsage, always require DominatorTreeWrapperPass.
> This will enable SROA::promoteAllocas to use mem2reg, which is fast because
> it caches the per basic-block ordering of the relevant loads and stores. If
> it's important to avoid always computing the dominator tree, computing it
> conditionally based on whether there is a huge basic block in the function
> is another idea, but I am not sure if that is possible (I don't think this
> is currently supported).
>
>
>  This brings down the compilation time (using clang -emit-llvm) from 350s
> to 30s (it still takes about 23s to do GVN). It also might fix PR17855 (the
> program that used to take 65s to compile now takes just 11s):
>
>
>  http://llvm.org/bugs/show_bug.cgi?id=17855
>
> Chandler commented here, I'll defer to him.
>
>
>  2. Cache the ordering of loads and stores in LoadAndStorePromoter::run.
> I don't know how hard it would be to implement this, but I think it would
> be as fast as 1 (using mem2reg).
>
> Caching the first load and first store for each alloca in a single pass
> doesn't seem too bad.  (Assuming that that doesn't change across iterations
> at least - I'm not particularly familiar with this code.)  This seems like
> a fairly isolated improvement.  Assuming Chandler doesn't disagree, I'd say
> this seems like a reasonable first step.
>
>
>  3. Insert a pass that splits huge basic blocks into smaller blocks prior
> to SROA and recombine them after the pass is run.
>
> How would this actually help?  If you have uses in each of the split
> blocks, wouldn't you spend the exact same time?
>
>
> It might not help much. The code in LoadAndStorePromoter::run avoids
scanning all instructions in a basic block if either there is only one user
in a basic block or there are no stores in a basic block, and splitting
blocks might increase the number of such blocks.

4. Change the definition of class Instruction in IR/Instruction.h so that
> determining whether one instruction precedes another becomes more
> efficient. My current implementation brings down the compilation time to
> just 6s.
>
>
>  I won't go into much detail about the change I made, but it basically
> replaces Instruction's pointer to its parent BasicBlock with another data
> structure. I haven't done much investigation into how this will affect the
> common case.
>
>
>  This will also help speed up the execution of code in other places. For
> example, these two functions are called when GVN::processLoad is called and
> iterate over a basic block's instruction list to determine the order of two
> (load and stores) instructions.:
>
>
>  llvm::isPotentiallyReachable
>
>
>      // Linear scan, start at 'A', see whether we hit 'B' or the end
> first.
>
>     for (BasicBlock::const_iterator I = A, E = BB->end(); I != E; ++I) {
>
>       if (&*I == B)
>
>         return true;
>
>     }
>
>  DominatorTree::dominates
>
>    // Loop through the basic block until we find Def or User.
>
>   BasicBlock::const_iterator I = DefBB->begin();
>
>   for (; &*I != Def && &*I != User; ++I)
>
>     /*empty*/;
>
> I'd be curious to see a proposal here.  I've been bitten by the
> DT::dominates issue myself.
>
>
> I was hesitant to make changes in classes such as BasicBlock and
Instruction just to make it faster to compile a basic block with 150K+
instructions (which I think is a pathological case). If there are other use
cases that need discovering orders of instructions in a basic block, it
might be worth considering fixing those classes.

The attached patch is the current implementation I have. Basically I am
trying to make function Instruction::precede fast and use it to speed up
functions that are currently iterating over the whole instruction list to
determine the order of instructions in a basic block.

 5. Simply give up doing SROA (and GVN::processLoad) if there is a huge
> basic block that can potentially take a long time to compile.
>
> Until we have something better, this might be a necessary hack.  Whether
> it's actually worth introducing depends on how long Chandler thinks the
> mem2reg version will take to land.
>
>
I tried this idea today, and unfortunately it didn't reduce the compilation
time. Skipping SROA leaves a lot of loads/stores in the code that could
have been promoted, and that places the burden on other passes that are run
later, such as GVN and DSE.

Philip
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140515/fc74f553/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sublist1.patch
Type: application/octet-stream
Size: 24745 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140515/fc74f553/attachment.obj>