[PATCH] D31175: Improve TargetTransformInfo::getCFInstrCost()

Thu Mar 23 06:06:35 PDT 2017

hfinkel added inline comments.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7215-7223
+    // Let TTI model the extra cost in case of a branch before a predicated
+    // block, by passing VF in such cases.
+    unsigned PredicatedBBVF = 0;
+    BranchInst *BI = cast<BranchInst>(I);
+    if (VF > 1 && BI->isConditional() &&
+        (PredicatedBBsAfterVectorization.count(BI->getSuccessor(0)) ||
+         PredicatedBBsAfterVectorization.count(BI->getSuccessor(1))))
----------------
jonpa wrote:
> mssimpso wrote:
> > jonpa wrote:
> > > mssimpso wrote:
> > > > We may want to model the branch cost more carefully *inside the vectorizer* for the predication and if-conversion cases. It doesn't look like we model this at all yet.
> > > > 
> > > > I think there are 3 cases to consider: (1) the back-edge branch, (2) branches that are if-converted, and (3) branches that are unrolled/replicated due to predication. I think the current cost for Br is enough for (1). For (2) the branch goes away so the cost would be zero. But we should also think about the the cost of the selects introduced for if-conversion too at some point (there's a TODO for this with the PHI cost). For (3), I imagine we would, at least initially, model the replicated branches similar to the way we model scalarized instructions. Something like VF * TTI.getCFInstrCost().
> > > I think all these cases can be handled with the use of the new PredicatedBBVF variable, or?
> > > 
> > > 1) Back-edge branch: PredicatedBBVF==0
> > > 2) if-converted branches: PredicatedBBVF==0
> > > 3) Unrolled branches: PredicatedBBVF==VF.
> > > 
> > > So you are suggesting that getCFInstrCost(Instruction::Br) is called for the cost of a branch instruction, and that the LoopVectorizer should here instead decide when to call it, rather than passing PredicatedBBVF to it?
> > > 
> > > So you are suggesting that getCFInstrCost(Instruction::Br) is called for the cost of a branch instruction, and that the LoopVectorizer should here instead decide when to call it, rather than passing PredicatedBBVF to it?
> > 
> > That's essentially what I'm saying, yes. This seems fairly straightforward to model in a target-independent way inside the vectorizer without changing the TTI interface. It's also something the vectorizer has made no effort to model up until now.
> > 
> > I'm also saying that the costs for each of the 3 scenarios should be different. For example, for the back-edge case, the cost should be TTI.getCFInstrCost(). For the if-converted case (VF > 1), the cost should be zero (no TTI call), etc.
> Looking at the very first few sections in the summary above - this then puts me back to the question of how the target should model the extra costs of the extract+element compare before each unrolled branch?
> 
> 1. Either that cost is added to the branch as the current patch suggests, by assuming that each such branch for each scalarized part will incur these extra costs.
> 
> 2. The cost is added to the compare where it actually belongs. The downside is that even though a compare is the common case, there are also cases of e.g. a logical combination of two compares producing the boolean vector.
> 
> It seems to me that either way, there has to be a new parameter to either getCFInstrCost(), or getCmpSelInstrCost() that passes on the VF in the cases where a branch will be multiplied around predicated and scalarized blocks.
> 
> Unless of course, we could assume that the need for extract + element compare is generally true, so that in case 3 we could just add TTI costs for these two instruction times VF ?
> 
> The cost is added to the compare where it actually belongs. The downside is that even though a compare is the common case, there are also cases of e.g. a logical combination of two compares producing the boolean vector.

But isn't that a lowering decision for which the cost model should always account?

> It seems to me that either way, there has to be a new parameter to either getCFInstrCost(), or getCmpSelInstrCost() that passes on the VF in the cases where a branch will be multiplied around predicated and scalarized blocks.

Why? In what cases would that return different results from the current calls with vectorized types?

> Unless of course, we could assume that the need for extract + element compare is generally true, so that in case 3 we could just add TTI costs for these two instruction times VF ?

Can you elaborate on the cases where this would not be true. Maybe we need to something else in the case where the condition is loop invariant (but hasn't been unswitched)?

https://reviews.llvm.org/D31175