[PATCH] D98512: [LoopVectorize] Simplify scalar cost calculation in getInstructionCost
Florian Hahn via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Mar 15 03:25:40 PDT 2021
fhahn added a comment.
In D98512#2624996 <https://reviews.llvm.org/D98512#2624996>, @dmgreen wrote:
> I agree that the IV update in the test should have a cost of 1. (It seems strange it doesn't already, I suspect it isn't matched the same as an increment 1).
>
> Would it always be true for the addresses of scalarized loads/stores? They do need to calculate N addresses.
Hmm, but if there are uses of the IV in this case, don't we need to generate a scalar value for each element in the VF? E.g consider the example below
define void @test(half* %b, i32 %n) {
entry:
br label %for.body
for.body:
%i = phi i32 [ 0, %entry ], [ %i.next, %for.body ]
%tmp0 = getelementptr half, half* %b, i32 %i
%lv = load half, half* %tmp0
%add = fadd half %lv, 40.0
store half %add, half* %tmp0, align 1
%i.next = add nuw nsw i32 %i, 3
%cond = icmp eq i32 %i.next, %n
br i1 %cond, label %for.end, label %for.body
for.end:
ret void
}
Generates the vector body below with ` -S -force-vector-width=4 -mtriple=arm64-apple-iphoneos`. Note ` %5 = add i32 %offset.idx, 0`, `%6 = add i32 %offset.idx, 3`, `%7 = add i32 %offset.idx, 6` & ` %8 = add i32 %offset.idx, 9`. With this change, the cost of the induction increment is 1, but scalarization requires us to create 4 scalar increments. I don't think the cost of that is accounted elsewhere?
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%offset.idx = mul i32 %index, 3
%5 = add i32 %offset.idx, 0
%6 = add i32 %offset.idx, 3
%7 = add i32 %offset.idx, 6
%8 = add i32 %offset.idx, 9
%9 = getelementptr half, half* %b, i32 %5
%10 = getelementptr half, half* %b, i32 %6
%11 = getelementptr half, half* %b, i32 %7
%12 = getelementptr half, half* %b, i32 %8
%13 = getelementptr half, half* %9, i32 0
%14 = bitcast half* %13 to <12 x half>*
%wide.vec = load <12 x half>, <12 x half>* %14, align 2
%strided.vec = shufflevector <12 x half> %wide.vec, <12 x half> poison, <4 x i32> <i32 0, i32 3, i32 6, i32 9>
%15 = fadd <4 x half> %strided.vec, <half 0xH5100, half 0xH5100, half 0xH5100, half 0xH5100>
%16 = extractelement <4 x half> %15, i32 0
store half %16, half* %9, align 1
%17 = extractelement <4 x half> %15, i32 1
store half %17, half* %10, align 1
%18 = extractelement <4 x half> %15, i32 2
store half %18, half* %11, align 1
%19 = extractelement <4 x half> %15, i32 3
store half %19, half* %12, align 1
%index.next = add i32 %index, 4
%20 = icmp eq i32 %index.next, %n.vec
br i1 %20, label %middle.block, label %vector.body, !llvm.loop !0
================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7377
SmallVector<const Value *, 4> Operands(I->operand_values());
- unsigned N = isScalarAfterVectorization(I, VF) ? VF.getKnownMinValue() : 1;
- return N * TTI.getArithmeticInstrCost(
- I->getOpcode(), VectorTy, CostKind,
- TargetTransformInfo::OK_AnyValue,
- Op2VK, TargetTransformInfo::OP_None, Op2VP, Operands, I);
+ return TTI.getArithmeticInstrCost(
+ I->getOpcode(), VectorTy, CostKind, TargetTransformInfo::OK_AnyValue,
----------------
I think conceptually there's nothing preventing us from reaching this code with an instruction that needs to be scalarized if the VF > 1. It might not occur in practice for some/most backends at the moment, because most of the opcodes above can be widened on most backends and we do not really properly track which instructions will be scalarized at this point, except for pointer & induction variables.
If we make the code less general, I think we should have at least an assertion to make sure we never reach this code when VF > 1 && scalarization is requested, so we at least notice once it really matters and we need to be more general again.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D98512/new/
https://reviews.llvm.org/D98512
More information about the llvm-commits
mailing list