[llvm] r217144 - Enable noalias metadata by default and swap the order of the SLP and Loop vectorizers by default.

Thu Sep 11 10:26:32 PDT 2014

To be clear, there are two points I think are worth inspecting:
* Increasing the cost of sub register moves (insert element, extract element) on armv7. insertelement will do for this benchmark.
* Why is the cost of the scalar shift so high. I strongly suspect that is because we don’t support i64 scalar shifts but have not verified this.

> On Sep 11, 2014, at 9:39 AM, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:
> 
> 
>> On Sep 11, 2014, at 8:08 AM, James Molloy <James.Molloy at arm.com> wrote:
>> 
>> Hi Louis,
>> 
>> I’ve spent some time debugging this - looping Arnold in too because it
>> involves the SLP vectorizer, and Quentin because it involves cross-class
>> copies in the backend.
>> 
>> The change itself is fairly innocuous. It does however permute IR which
>> seemingly means that now, the SLP vectorizer decides to have a go at
>> vectorizing right in the middle of the hottest function, quantum_toffoli.
>> 
>> It ends up putting cross-class “vmov”s right in the middle of the loop.
>> They’re invariant, and it’s obviously a terrible idea. AArch64 is not
>> affected because:
>> (a) it has a better cost model for arithmetic instructions, that allows
>> (b) the loop vectorizer to have a crack at unrolling by 4x, because
>> (c) there are more scalar registers available on A64.
>> 
>> Incidentally as part of this I’ve just discovered that libquantum is 50%
>> slower on A32 than A64! :/
> 
> Yes, you don’t want to get  toffoli wrong if you care about lib quantum :).
> 
> getVectorInstr estimates the cost for insert and extract element.
> 
> On swift (armv7s) the penalty for insertelement saves us:
> 
> 
> unsigned ARMTTI::getVectorInstrCost(unsigned Opcode, Type *ValTy,
>                                    unsigned Index) const {
>  // Penalize inserting into an D-subregister. We end up with a three times
>  // lower estimated throughput on swift.
>  if (ST->isSwift() &&
>      Opcode == Instruction::InsertElement &&
>      ValTy->isVectorTy() &&
>      ValTy->getScalarSizeInBits() <= 32)
>    return 3;
> 
>  return TargetTransformInfo::getVectorInstrCost(Opcode, ValTy, Index);
> }
> 
> armv7:
> 
> SLP: Calculating cost for tree of size 4.
> SLP: Adding cost -6 for bundle that starts with   %shl = shl i64 1, %sh_prom . // This estimate is surprising to me.
> SLP: Adding cost 0 for bundle that starts with i64 1 .
> SLP: Adding cost -1 for bundle that starts with   %sh_prom = zext i32 %control1 to i64 .
> SLP: Adding cost 2 for bundle that starts with i32 %control1 .
> SLP: #LV: 0, Looking at   %sh_prom = zext i32 %control1 to i64
> SLP: SpillCost=0
> SLP: Total Cost -1.
> 
> 
> armv7s:
> 
> SLP: Calculating cost for tree of size 4.
> SLP: Adding cost -6 for bundle that starts with   %shl = shl i64 1, %sh_prom .
> SLP: Adding cost 0 for bundle that starts with i64 1 .
> SLP: Adding cost -1 for bundle that starts with   %sh_prom = zext i32 %control1 to i64 .
> SLP: Adding cost 6 for bundle that starts with i32 %control1 .
> SLP: #LV: 0, Looking at   %sh_prom = zext i32 %control1 to i64
> SLP: SpillCost=0
> SLP: Total Cost 3.
> 
> 
> for.body.lr.ph:                                   ; preds = %for.cond.preheader
>  %node = getelementptr inbounds %struct.quantum_reg_struct* %reg, i32 0, i32 3
>  %2 = load %struct.quantum_reg_node_struct** %node, align 4, !tbaa !10
>  %sh_prom = zext i32 %control1 to i64  // + Cost of insert_element of [%control1, %control2] + Cost of zext [%control1, %control2]
>  %shl = shl i64 1, %sh_prom            // + Cost of shl of [1, 1], [%sh_prom, %sh_prom8]
>  %sh_prom8 = zext i32 %control2 to i64
>  %shl9 = shl i64 1, %sh_prom8
>  %sh_prom13 = zext i32 %target to i64
>  %shl14 = shl i64 1, %sh_prom13
>  br label %for.body
> 
> for.body:                                         ; preds = %for.body.lr.ph, %for.inc
>  %i.034 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.inc ]
>  %state = getelementptr inbounds %struct.quantum_reg_node_struct* %2, i32 %i.034, i32 1
>  %3 = load i64* %state, align 4, !tbaa !11
>  %4 = or i64 %shl, %shl9 // We seemed to be starting the tree here upwards.
>  %5 = and i64 %3, %4
>  %6 = icmp eq i64 %5, %4
>  br i1 %6, label %if.then12, label %for.inc
> 
> 
> armv7:
> 
> for.body.lr.ph:                                   ; preds = %for.cond.preheader
>  %node = getelementptr inbounds %struct.quantum_reg_struct* %reg, i32 0, i32 3
>  %2 = load %struct.quantum_reg_node_struct** %node, align 4, !tbaa !10
>  %3 = insertelement <2 x i32> undef, i32 %control1, i32 0
>  %4 = insertelement <2 x i32> %3, i32 %control2, i32 1
>  %5 = zext <2 x i32> %4 to <2 x i64>
>  %6 = shl <2 x i64> <i64 1, i64 1>, %5
>  %sh_prom13 = zext i32 %target to i64
>  %shl14 = shl i64 1, %sh_prom13
>  br label %for.body
> 
> for.body:                                         ; preds = %for.body.lr.ph, %for.inc
>  %i.034 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.inc ]
>  %state = getelementptr inbounds %struct.quantum_reg_node_struct* %2, i32 %i.034, i32 1
>  %7 = load i64* %state, align 4, !tbaa !11
>  %8 = extractelement <2 x i64> %6, i32 0
>  %9 = extractelement <2 x i64> %6, i32 1
>  %10 = or i64 %8, %9
>  %11 = and i64 %7, %10
>  %12 = icmp eq i64 %11, %10
>  br i1 %12, label %if.then12, label %for.inc