[llvm] r217144 - Enable noalias metadata by default and swap the order of the SLP and Loop vectorizers by default.
Arnold Schwaighofer
aschwaighofer at apple.com
Thu Sep 11 10:26:32 PDT 2014
To be clear, there are two points I think are worth inspecting:
* Increasing the cost of sub register moves (insert element, extract element) on armv7. insertelement will do for this benchmark.
* Why is the cost of the scalar shift so high. I strongly suspect that is because we don’t support i64 scalar shifts but have not verified this.
> On Sep 11, 2014, at 9:39 AM, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:
>
>
>> On Sep 11, 2014, at 8:08 AM, James Molloy <James.Molloy at arm.com> wrote:
>>
>> Hi Louis,
>>
>> I’ve spent some time debugging this - looping Arnold in too because it
>> involves the SLP vectorizer, and Quentin because it involves cross-class
>> copies in the backend.
>>
>> The change itself is fairly innocuous. It does however permute IR which
>> seemingly means that now, the SLP vectorizer decides to have a go at
>> vectorizing right in the middle of the hottest function, quantum_toffoli.
>>
>> It ends up putting cross-class “vmov”s right in the middle of the loop.
>> They’re invariant, and it’s obviously a terrible idea. AArch64 is not
>> affected because:
>> (a) it has a better cost model for arithmetic instructions, that allows
>> (b) the loop vectorizer to have a crack at unrolling by 4x, because
>> (c) there are more scalar registers available on A64.
>>
>> Incidentally as part of this I’ve just discovered that libquantum is 50%
>> slower on A32 than A64! :/
>
> Yes, you don’t want to get toffoli wrong if you care about lib quantum :).
>
> getVectorInstr estimates the cost for insert and extract element.
>
> On swift (armv7s) the penalty for insertelement saves us:
>
>
> unsigned ARMTTI::getVectorInstrCost(unsigned Opcode, Type *ValTy,
> unsigned Index) const {
> // Penalize inserting into an D-subregister. We end up with a three times
> // lower estimated throughput on swift.
> if (ST->isSwift() &&
> Opcode == Instruction::InsertElement &&
> ValTy->isVectorTy() &&
> ValTy->getScalarSizeInBits() <= 32)
> return 3;
>
> return TargetTransformInfo::getVectorInstrCost(Opcode, ValTy, Index);
> }
>
> armv7:
>
> SLP: Calculating cost for tree of size 4.
> SLP: Adding cost -6 for bundle that starts with %shl = shl i64 1, %sh_prom . // This estimate is surprising to me.
> SLP: Adding cost 0 for bundle that starts with i64 1 .
> SLP: Adding cost -1 for bundle that starts with %sh_prom = zext i32 %control1 to i64 .
> SLP: Adding cost 2 for bundle that starts with i32 %control1 .
> SLP: #LV: 0, Looking at %sh_prom = zext i32 %control1 to i64
> SLP: SpillCost=0
> SLP: Total Cost -1.
>
>
> armv7s:
>
> SLP: Calculating cost for tree of size 4.
> SLP: Adding cost -6 for bundle that starts with %shl = shl i64 1, %sh_prom .
> SLP: Adding cost 0 for bundle that starts with i64 1 .
> SLP: Adding cost -1 for bundle that starts with %sh_prom = zext i32 %control1 to i64 .
> SLP: Adding cost 6 for bundle that starts with i32 %control1 .
> SLP: #LV: 0, Looking at %sh_prom = zext i32 %control1 to i64
> SLP: SpillCost=0
> SLP: Total Cost 3.
>
>
> for.body.lr.ph: ; preds = %for.cond.preheader
> %node = getelementptr inbounds %struct.quantum_reg_struct* %reg, i32 0, i32 3
> %2 = load %struct.quantum_reg_node_struct** %node, align 4, !tbaa !10
> %sh_prom = zext i32 %control1 to i64 // + Cost of insert_element of [%control1, %control2] + Cost of zext [%control1, %control2]
> %shl = shl i64 1, %sh_prom // + Cost of shl of [1, 1], [%sh_prom, %sh_prom8]
> %sh_prom8 = zext i32 %control2 to i64
> %shl9 = shl i64 1, %sh_prom8
> %sh_prom13 = zext i32 %target to i64
> %shl14 = shl i64 1, %sh_prom13
> br label %for.body
>
> for.body: ; preds = %for.body.lr.ph, %for.inc
> %i.034 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.inc ]
> %state = getelementptr inbounds %struct.quantum_reg_node_struct* %2, i32 %i.034, i32 1
> %3 = load i64* %state, align 4, !tbaa !11
> %4 = or i64 %shl, %shl9 // We seemed to be starting the tree here upwards.
> %5 = and i64 %3, %4
> %6 = icmp eq i64 %5, %4
> br i1 %6, label %if.then12, label %for.inc
>
>
> armv7:
>
> for.body.lr.ph: ; preds = %for.cond.preheader
> %node = getelementptr inbounds %struct.quantum_reg_struct* %reg, i32 0, i32 3
> %2 = load %struct.quantum_reg_node_struct** %node, align 4, !tbaa !10
> %3 = insertelement <2 x i32> undef, i32 %control1, i32 0
> %4 = insertelement <2 x i32> %3, i32 %control2, i32 1
> %5 = zext <2 x i32> %4 to <2 x i64>
> %6 = shl <2 x i64> <i64 1, i64 1>, %5
> %sh_prom13 = zext i32 %target to i64
> %shl14 = shl i64 1, %sh_prom13
> br label %for.body
>
> for.body: ; preds = %for.body.lr.ph, %for.inc
> %i.034 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.inc ]
> %state = getelementptr inbounds %struct.quantum_reg_node_struct* %2, i32 %i.034, i32 1
> %7 = load i64* %state, align 4, !tbaa !11
> %8 = extractelement <2 x i64> %6, i32 0
> %9 = extractelement <2 x i64> %6, i32 1
> %10 = or i64 %8, %9
> %11 = and i64 %7, %10
> %12 = icmp eq i64 %11, %10
> br i1 %12, label %if.then12, label %for.inc
More information about the llvm-commits
mailing list