[PATCH] D31175: Improve TargetTransformInfo::getCFInstrCost()

Thu Mar 23 07:41:48 PDT 2017

hfinkel added inline comments.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7215-7223
+    // Let TTI model the extra cost in case of a branch before a predicated
+    // block, by passing VF in such cases.
+    unsigned PredicatedBBVF = 0;
+    BranchInst *BI = cast<BranchInst>(I);
+    if (VF > 1 && BI->isConditional() &&
+        (PredicatedBBsAfterVectorization.count(BI->getSuccessor(0)) ||
+         PredicatedBBsAfterVectorization.count(BI->getSuccessor(1))))
----------------
jonpa wrote:
> hfinkel wrote:
> > jonpa wrote:
> > > mssimpso wrote:
> > > > jonpa wrote:
> > > > > mssimpso wrote:
> > > > > > We may want to model the branch cost more carefully *inside the vectorizer* for the predication and if-conversion cases. It doesn't look like we model this at all yet.
> > > > > > 
> > > > > > I think there are 3 cases to consider: (1) the back-edge branch, (2) branches that are if-converted, and (3) branches that are unrolled/replicated due to predication. I think the current cost for Br is enough for (1). For (2) the branch goes away so the cost would be zero. But we should also think about the the cost of the selects introduced for if-conversion too at some point (there's a TODO for this with the PHI cost). For (3), I imagine we would, at least initially, model the replicated branches similar to the way we model scalarized instructions. Something like VF * TTI.getCFInstrCost().
> > > > > I think all these cases can be handled with the use of the new PredicatedBBVF variable, or?
> > > > > 
> > > > > 1) Back-edge branch: PredicatedBBVF==0
> > > > > 2) if-converted branches: PredicatedBBVF==0
> > > > > 3) Unrolled branches: PredicatedBBVF==VF.
> > > > > 
> > > > > So you are suggesting that getCFInstrCost(Instruction::Br) is called for the cost of a branch instruction, and that the LoopVectorizer should here instead decide when to call it, rather than passing PredicatedBBVF to it?
> > > > > 
> > > > > So you are suggesting that getCFInstrCost(Instruction::Br) is called for the cost of a branch instruction, and that the LoopVectorizer should here instead decide when to call it, rather than passing PredicatedBBVF to it?
> > > > 
> > > > That's essentially what I'm saying, yes. This seems fairly straightforward to model in a target-independent way inside the vectorizer without changing the TTI interface. It's also something the vectorizer has made no effort to model up until now.
> > > > 
> > > > I'm also saying that the costs for each of the 3 scenarios should be different. For example, for the back-edge case, the cost should be TTI.getCFInstrCost(). For the if-converted case (VF > 1), the cost should be zero (no TTI call), etc.
> > > Looking at the very first few sections in the summary above - this then puts me back to the question of how the target should model the extra costs of the extract+element compare before each unrolled branch?
> > > 
> > > 1. Either that cost is added to the branch as the current patch suggests, by assuming that each such branch for each scalarized part will incur these extra costs.
> > > 
> > > 2. The cost is added to the compare where it actually belongs. The downside is that even though a compare is the common case, there are also cases of e.g. a logical combination of two compares producing the boolean vector.
> > > 
> > > It seems to me that either way, there has to be a new parameter to either getCFInstrCost(), or getCmpSelInstrCost() that passes on the VF in the cases where a branch will be multiplied around predicated and scalarized blocks.
> > > 
> > > Unless of course, we could assume that the need for extract + element compare is generally true, so that in case 3 we could just add TTI costs for these two instruction times VF ?
> > > 
> > > The cost is added to the compare where it actually belongs. The downside is that even though a compare is the common case, there are also cases of e.g. a logical combination of two compares producing the boolean vector.
> > 
> > But isn't that a lowering decision for which the cost model should always account?
> > 
> > > It seems to me that either way, there has to be a new parameter to either getCFInstrCost(), or getCmpSelInstrCost() that passes on the VF in the cases where a branch will be multiplied around predicated and scalarized blocks.
> > 
> > Why? In what cases would that return different results from the current calls with vectorized types?
> > 
> > > Unless of course, we could assume that the need for extract + element compare is generally true, so that in case 3 we could just add TTI costs for these two instruction times VF ?
> > 
> > Can you elaborate on the cases where this would not be true. Maybe we need to something else in the case where the condition is loop invariant (but hasn't been unswitched)?
> > 
> > But isn't that a lowering decision for which the cost model should always account?
> I was thinking this to be somewhat rare, and also thinking that if this needed an extra parameter to getArithmeticInstrCost(), it might not be worth it.
> 
> > Why? In what cases would that return different results from the current calls with vectorized types?
> There is no type argument for getCFInstrCost(), but for the getCmpSelCost there is the vector operand type passed.
> 
> For the cost of a compare instruction, this would ordinarily mean the cost of a vector compare, producing a vector bitmask. On SystemZ, this bitmask is then if needed adjusted to the match the operands of the select instruction. 
> 
> In the case that the user is a branch, the cost becomes different, because the vector bitmask is not used in the same way, but rather with extract element; test-under-mask before each branch.
> 
> How is the getCmpSelInstrCost() supposed to know which case this belongs to? The vector operand type is the same, but only in the case of a branch are the extra costs also added.
> 
> Are you saying that if a vector type is passed to getCmpSelInstrCost(), along with the instruction pointer (which we don't do yet), it should then always be true that if the user of I is a branch, it is a branch around a block that is scalar and predicated with the same VF as the compare operands?
> 
> Or is it perhaps a possibility to use the second (currently unused) Type argument to pass a scalar Type and thus indicate that this is a vector compare that is going to be used elementwise in scalar contexts?
> 
> > Can you elaborate on the cases where this would not be true. Maybe we need to something else in the case where the condition is loop invariant (but hasn't been unswitched)?
> I don't know - I just thought there might be some target that could do better than this somehow. 
> 
>> But isn't that a lowering decision for which the cost model should always account?
> I was thinking this to be somewhat rare, and also thinking that if this needed an extra parameter to getArithmeticInstrCost(), it might not be worth it.

Let's worry about this when we have concrete examples.

> For the cost of a compare instruction, this would ordinarily mean the cost of a vector compare, producing a vector bitmask. On SystemZ, this bitmask is then if needed adjusted to the match the operands of the select instruction.

To summarize, on SystemZ it is actually cheaper to do the compare if it is being used by a branch (because you don't need to do the adjustment to match the operands). Interesting. I don't think we can use the "pass the instruction to check the users" in this case, because the compare might be used by a branch that the transformation intends to remove (if conversion or whatever), and we don't want to cost model to assume too much about the transformation asking for the costs. In this case, I recommend just adding a boolean parameter OnlyUsedByBranch = false, or something like that, so that you can model this reasonably.

>> Can you elaborate on the cases where this would not be true. Maybe we need to something else in the case where the condition is loop invariant (but hasn't been unswitched)?
> I don't know - I just thought there might be some target that could do better than this somehow.

Okay, let's just make the assumption for now and worry about potential exceptions when we find some.

https://reviews.llvm.org/D31175