<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On May 21, 2014, at 3:05 PM, Philip Reames <<a href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
<div bgcolor="#FFFFFF" text="#000000">
<br>
<div class="moz-cite-prefix">On 05/15/2014 11:12 AM, Akira Hatanaka
wrote:<br>
</div>
<blockquote cite="mid:CAB297QwaU1SB4ENicf6gTzV_Hkw5sJHPaG_wFHgTjL76rZoh8Q@mail.gmail.com" type="cite">
<div dir="ltr">On Thu, May 15, 2014 at 9:31 AM, Philip Reames <span dir="ltr"><<a moz-do-not-send="true" href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>></span>
wrote:<br>
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<div class="">
<div>On 05/14/2014 06:02 PM, Akira Hatanaka wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>I would like to get feedback from the
community on how I can speed up the compilation
of a function that has one huge basic block
consisting of over 150K instructions. It takes
about 10 minutes for clang to produce the object
file and the function that is taking up the
majority of the time is
LoadAndStorePromoter::run, which is called when
SROA is run:</div>
<div><br>
</div>
<div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
// Otherwise, we have mixed loads and stores
(or just a bunch of stores).</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
// Since SSAUpdater is purely for
cross-block values, we need to determine</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
// the order of these instructions in the
block. If the first use in the</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
// block is a load, then it uses the live in
value. The last store defines</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
// the live out value. We handle this by
doing a linear scan of the block.</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
Value *StoredValue = nullptr;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
for (BasicBlock::iterator II =
BB->begin(), E = BB->end(); II != E;
++II) {</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><br>
</div><div style="margin: 0px;"> If I understand this code
correctly. LoadAndStorePromoter::run is called
once per every promotable alloca and iterates
over the whole list to determine the order of
loads and stores in the basic block that
access the alloca.</div>
</div>
</div>
</blockquote>
</div>
I'm a bit confused about the cost here. Based on the
code, I see us walking each basic block at most once per
alloca. Even at 150K instructions, I wouldn't expect
the loop to take *that* long. How many allocas are you
dealing with?<br>
<br>
I do see an outer loop in SROA::performPromotion. Are
you possibly triggering a large number of iterations
there? That would make the inner loop appear much more
expensive. <br>
<div class=""> <br>
</div>
</div>
</blockquote>
<div><br>
</div>
<div>The unoptimized bitcode (compiled with clang -O0) has a
large number of allocas (over 7000), which I think is also
contributing to the long compilation time.</div>
</div>
</div>
</div>
</blockquote>
I suspect that many things in the optimizer aren't going to react
well to that many allocas. :) Where are they all coming from? And
why hasn't mem2reg removed most of them? I suspect you'd be better
off investing effort to address this issue then optimizing SROA. <br>
<blockquote cite="mid:CAB297QwaU1SB4ENicf6gTzV_Hkw5sJHPaG_wFHgTjL76rZoh8Q@mail.gmail.com" type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>
<br>
</div>
<div>But making LoadAndStorePromoter::run faster is
sufficient to reduce the compilation time (at least in my
case).</div>
</div>
</div>
</div>
</blockquote>
I'm not objecting to this. I'd be happy to see something along
these lines land, I'm just not sure it solves the root cause of your
issues. <br>
<blockquote cite="mid:CAB297QwaU1SB4ENicf6gTzV_Hkw5sJHPaG_wFHgTjL76rZoh8Q@mail.gmail.com" type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div> ...<br>
</div>
</div>
</div>
</div>
</blockquote>
<blockquote cite="mid:CAB297QwaU1SB4ENicf6gTzV_Hkw5sJHPaG_wFHgTjL76rZoh8Q@mail.gmail.com" type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>
</div>
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<div class="">
<blockquote type="cite">
<div dir="ltr">
<div><div style="margin: 0px;">4. Change the definition
of class Instruction in IR/Instruction.h so
that determining whether one instruction
precedes another becomes more efficient. My
current implementation brings down the
compilation time to just 6s.</div><div style="margin: 0px;"><br>
</div><div style="margin: 0px;">I won't go into much
detail about the change I made, but it
basically replaces Instruction's pointer to
its parent BasicBlock with another data
structure. I haven't done much investigation
into how this will affect the common case.</div><div style="margin: 0px;"><br>
</div>
<div>This will also help speed up the execution
of code in other places. For example, these
two functions are called when GVN::processLoad
is called and iterate over a basic block's
instruction list to determine the order of two
(load and stores) instructions.:</div><div style="margin: 0px;"><br>
</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"><span style="font-family:arial;font-size:small">llvm::isPotentiallyReachable</span></div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
<br>
</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
// Linear scan, start at 'A', see whether we
hit 'B' or the end first.</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
for (BasicBlock::const_iterator I = A, E =
BB->end(); I != E; ++I) {</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
if (&*I == B)</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
return true;</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
}</div>
<div><br>
</div>
<div>DominatorTree::dominates</div>
<div><br>
</div>
<div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> //
Loop through the basic block until we find
Def or User.</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
BasicBlock::const_iterator I =
DefBB->begin();</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;"> for
(; &*I != Def && &*I !=
User; ++I)</div><div style="margin: 0px; font-size: 11px; font-family: Menlo;">
/*empty*/;</div>
</div>
</div>
</div>
</blockquote>
</div>
I'd be curious to see a proposal here. I've been bitten
by the DT::dominates issue myself. <br>
<div class="">
<blockquote type="cite">
<div dir="ltr">
<div>
<div><br>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
<div>I was hesitant to make changes in classes such as
BasicBlock and Instruction just to make it faster to
compile a basic block with 150K+ instructions (which I
think is a pathological case). If there are other use
cases that need discovering orders of instructions in a
basic block, it might be worth considering fixing those
classes.</div>
<div> </div>
<div>The attached patch is the current implementation I
have. Basically I am trying to make function
Instruction::precede fast and use it to speed up functions
that are currently iterating over the whole instruction
list to determine the order of instructions in a basic
block. <br>
</div>
</div>
</div>
</div>
</blockquote>
I get where you're going with this, but I don't think this is the
right approach. In particular, you're introducing extra
indirections along some fairly common paths and the caching scheme
around sublist is non-obvious. <br></div></blockquote><div><br></div>Akira, I think this is worth solving, but probably not for this specific issue. I don't think it's uncommon for optimizations to want to understand the relatively ordering of two instructions.</div><div><br></div><div>I don't claim to know the right solution. Philip, is your concern of Akira's approach about performance in the common case? Or is is about the complexity of the scheme? Akira, if would be good to have some analysis of the efficiency of various operations related to instructions. It's good to make the 'proceeds' query fast, but not at the expensive of insertion, deletion, etc.</div><div><br></div><div>Evan</div><div><br></div><div><br><blockquote type="cite"><div bgcolor="#FFFFFF" text="#000000">
<br>
I do like the extraction of the 'preceeds' interface. It gives us a
single place to optimize. <br>
<br>
Looking at the problem, I don't see an easy answer here. If we
weren't using linked lists, this would be easy, but with that
constraint, everything I come up with seems nasty. <br>
<br>
One somewhat less nasty (but still bad) scheme I came up with would
be to use the traits of the InstList in basic block to create a
listener which ignores all updates except when order is interesting
(i.e. inside the SROA code). This seems really bug prone though. I
am not in favour of actually implementing such a thing. <br>
<br>
Another scheme might be to have an Analysis which caches the answer
"does X proceed Y". By default, nothing would preserve this, and we
could teach certain passes to be aware of it. Again, this feels
dirty. <br>
<br>
I don't really have a good solution here. <br>
<br>
<blockquote cite="mid:CAB297QwaU1SB4ENicf6gTzV_Hkw5sJHPaG_wFHgTjL76rZoh8Q@mail.gmail.com" type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div><br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF">
<div class="">
<blockquote type="cite">
<div dir="ltr">
<div>
<div> </div><div style="margin: 0px;">5. Simply give up doing
SROA (and GVN::processLoad) if there is a huge
basic block that can potentially take a long
time to compile.</div>
</div>
</div>
</blockquote>
</div>
Until we have something better, this might be a
necessary hack. Whether it's actually worth introducing
depends on how long Chandler thinks the mem2reg version
will take to land. <br>
<span class=""><font color="#888888"> <br>
</font></span></div>
</blockquote>
<div><br>
</div>
<div>I tried this idea today, and unfortunately it didn't
reduce the compilation time. Skipping SROA leaves a lot of
loads/stores in the code that could have been promoted,
and that places the burden on other passes that are run
later, such as GVN and DSE.</div>
</div>
</div>
</div>
</blockquote>
Makes some sense. See my point above about lots of allocas
stressing the optimizer. :)<br>
<br>
p.s. Any chance of bugs being filed with test cases for each of
those? :)<br>
<br>
Philip<br>
</div>
_______________________________________________<br>LLVM Developers mailing list<br><a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu">http://llvm.cs.uiuc.edu</a><br><a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br></blockquote></div><br></body></html>