[llvm-commits] [llvm] r89187 - in /llvm/trunk: include/llvm/Target/TargetInstrInfo.h lib/CodeGen/BranchFolding.cpp lib/Target/ARM/ARMBaseInstrInfo.cpp lib/Target/ARM/ARMBaseInstrInfo.h lib/Target/ARM/ARMSubtarget.cpp lib/Target/ARM/ARMSubtarget.h

Sat Nov 21 06:49:02 PST 2009

On Nov 19, 2009, at 10:52 PM, Bob Wilson wrote:
> On Nov 19, 2009, at 10:28 PM, Chris Lattner wrote:
>> 
>> I don't really buy this.  Are you really claiming that duplicating a 10K instruction basic block is worth it?  In reality there has to be a balance, even for ARM.  This is also likely to be a huge win for X86 but this is just like jump threading: while eliminating correlated branches is *always* a win from the dynamic instruction count perspective, we balance the benefit with the code size cost.  I don't see how this case is any different.
> 
> The size of the block is definitely limited -- the whole point of the target hook is to adjust that limit.  The aspect that we don't limit is the number of predecessors where we may duplicate that block.

Right ... so there is no limit to the amount duplicated...

>> In practice, must jump table indirect gotos are preceded by a conditional branch that checks the "range" of the table anyway, so it won't matter.  However, if that weren't the case, this optimization would be just as useful for switches as indbr's.  Ideally the same code *should* apply to both.
> 
> Not necessarily.  Our implementation of indirect branches artificially combines all the indirect branches in a function into a single branch.  That has a very bad effect on branch prediction.  The main reason we need to do this aggressive tail duplication for indirect branches is to essentially undo that transformation.

I don't see how that is related.  I'm not debating that this is important for indbr's. :)

I'm saying that it is just as important for switches in the uncommon case when they are structurally the same.  I'm arguing here that code structure is what matters, not the target ISA.

>>>> Using the extant isIndirectBranch flag would be best, but even adding this sort of target hook would be somewhat ok.  At least this would be a property of the architecture.  If we can avoid it, I'd definitely prefer to of course.
>>> 
>>> The isIndirectBranch flag would not allow us to distinguish jump table branches.
>> 
>> I don't think we want to :).  Why do we want to?
> 
> Indirect branches (i.e., "computed gotos", not jump tables) are most often used for interpreters.

Uh, switches are commonly used by interpreters also.  The portable kind of interpreter :)

> Besides the "undo the front end's factoring of the CFG" motivation for treating indirect branches specially, there is more to it than that.  It is quite common for an interpreter to see common patterns in the operations it handles.  (This is especially true for certain benchmarks we care about.)

This is also very true of switch statements!

> The typical interpreter loop has a chunk of code to handle each operation, ending with an indirect branch to go to the next operation.  When there are patterns in the order of interpreted operations, those indirect branches become predictable -- but only if they are duplicated into the separate chunks of code for each operation.

Again, I don't see how this is any different for a switch in a loop vs an indirectbr in a loop.

> Applying this intuition to the code size question above, if an interpreter loop handles 1000 different operations, we would still want to duplicate the indirect branches into every one of those 1000 chunks of code, as long as the code being duplicated is "small enough".  (I am thinking here of processors that can predict those branches and where the branch misprediction penalty is significant.  You would want to make a different tradeoff for a processor with no branch prediction.)

No, I don't see it this way.  What you're saying is that it is "worth it" to pay a code size cost to get a performance win in this case, because the win is high.  I am not debating this at all!

This is a completely acceptable tradeoff, I just want it factored the right way.  I'd be very happy if the code in taildupe said "if the block ends in an indirect goto operation, and if "it is profitable to dupliate indirect gotos for the target" then increase the threshold a bit.  I don't like asking the target how much to increase the threshold.  "How much to increase the threshold" is not asking a target property.  Asking "is it profitable to duplicate indirect gotos" is.

>> And why is +2 the "right" amount?  Because it happens to be enough to get one particular testcase that you care about, or because of some fundamental property of the architecture?
> 
> It is for the same reason that -tail-merge-size defaults to "3".  ;-)
> 
> The default limit for tail duplication is "tail-merge-size" - 1.  That is also completely arbitrary.  We pick values that work well for the code we have measured and that we care about.  There's nothing fundamental about them.

Right.  So now instead of having one arbitrary place (in the tail dupe code) to change in the future, we have that one, plus one per target!

-Chris