[llvm] r217144 - Enable noalias metadata by default and swap the order of the SLP and Loop vectorizers by default.

Thu Sep 11 09:39:12 PDT 2014

> On Sep 11, 2014, at 8:08 AM, James Molloy <James.Molloy at arm.com> wrote:
> 
> Hi Louis,
> 
> I’ve spent some time debugging this - looping Arnold in too because it
> involves the SLP vectorizer, and Quentin because it involves cross-class
> copies in the backend.
> 
> The change itself is fairly innocuous. It does however permute IR which
> seemingly means that now, the SLP vectorizer decides to have a go at
> vectorizing right in the middle of the hottest function, quantum_toffoli.
> 
> It ends up putting cross-class “vmov”s right in the middle of the loop.
> They’re invariant, and it’s obviously a terrible idea. AArch64 is not
> affected because:
>  (a) it has a better cost model for arithmetic instructions, that allows
>  (b) the loop vectorizer to have a crack at unrolling by 4x, because
>  (c) there are more scalar registers available on A64.
> 
> Incidentally as part of this I’ve just discovered that libquantum is 50%
> slower on A32 than A64! :/

Yes, you don’t want to get  toffoli wrong if you care about lib quantum :).

getVectorInstr estimates the cost for insert and extract element.

On swift (armv7s) the penalty for insertelement saves us:

unsigned ARMTTI::getVectorInstrCost(unsigned Opcode, Type *ValTy,
                                    unsigned Index) const {
  // Penalize inserting into an D-subregister. We end up with a three times
  // lower estimated throughput on swift.
  if (ST->isSwift() &&
      Opcode == Instruction::InsertElement &&
      ValTy->isVectorTy() &&
      ValTy->getScalarSizeInBits() <= 32)
    return 3;

  return TargetTransformInfo::getVectorInstrCost(Opcode, ValTy, Index);
}

armv7:

SLP: Calculating cost for tree of size 4.
SLP: Adding cost -6 for bundle that starts with   %shl = shl i64 1, %sh_prom . // This estimate is surprising to me.
SLP: Adding cost 0 for bundle that starts with i64 1 .
SLP: Adding cost -1 for bundle that starts with   %sh_prom = zext i32 %control1 to i64 .
SLP: Adding cost 2 for bundle that starts with i32 %control1 .
SLP: #LV: 0, Looking at   %sh_prom = zext i32 %control1 to i64
SLP: SpillCost=0
SLP: Total Cost -1.

armv7s:

SLP: Calculating cost for tree of size 4.
SLP: Adding cost -6 for bundle that starts with   %shl = shl i64 1, %sh_prom .
SLP: Adding cost 0 for bundle that starts with i64 1 .
SLP: Adding cost -1 for bundle that starts with   %sh_prom = zext i32 %control1 to i64 .
SLP: Adding cost 6 for bundle that starts with i32 %control1 .
SLP: #LV: 0, Looking at   %sh_prom = zext i32 %control1 to i64
SLP: SpillCost=0
SLP: Total Cost 3.

for.body.lr.ph:                                   ; preds = %for.cond.preheader
  %node = getelementptr inbounds %struct.quantum_reg_struct* %reg, i32 0, i32 3
  %2 = load %struct.quantum_reg_node_struct** %node, align 4, !tbaa !10
  %sh_prom = zext i32 %control1 to i64  // + Cost of insert_element of [%control1, %control2] + Cost of zext [%control1, %control2]
  %shl = shl i64 1, %sh_prom            // + Cost of shl of [1, 1], [%sh_prom, %sh_prom8]
  %sh_prom8 = zext i32 %control2 to i64
  %shl9 = shl i64 1, %sh_prom8
  %sh_prom13 = zext i32 %target to i64
  %shl14 = shl i64 1, %sh_prom13
  br label %for.body

for.body:                                         ; preds = %for.body.lr.ph, %for.inc
  %i.034 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.inc ]
  %state = getelementptr inbounds %struct.quantum_reg_node_struct* %2, i32 %i.034, i32 1
  %3 = load i64* %state, align 4, !tbaa !11
  %4 = or i64 %shl, %shl9 // We seemed to be starting the tree here upwards.
  %5 = and i64 %3, %4
  %6 = icmp eq i64 %5, %4
  br i1 %6, label %if.then12, label %for.inc

armv7:

for.body.lr.ph:                                   ; preds = %for.cond.preheader
  %node = getelementptr inbounds %struct.quantum_reg_struct* %reg, i32 0, i32 3
  %2 = load %struct.quantum_reg_node_struct** %node, align 4, !tbaa !10
  %3 = insertelement <2 x i32> undef, i32 %control1, i32 0
  %4 = insertelement <2 x i32> %3, i32 %control2, i32 1
  %5 = zext <2 x i32> %4 to <2 x i64>
  %6 = shl <2 x i64> <i64 1, i64 1>, %5
  %sh_prom13 = zext i32 %target to i64
  %shl14 = shl i64 1, %sh_prom13
  br label %for.body

for.body:                                         ; preds = %for.body.lr.ph, %for.inc
  %i.034 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.inc ]
  %state = getelementptr inbounds %struct.quantum_reg_node_struct* %2, i32 %i.034, i32 1
  %7 = load i64* %state, align 4, !tbaa !11
  %8 = extractelement <2 x i64> %6, i32 0
  %9 = extractelement <2 x i64> %6, i32 1
  %10 = or i64 %8, %9
  %11 = and i64 %7, %10
  %12 = icmp eq i64 %11, %10
  br i1 %12, label %if.then12, label %for.inc

> The answer is to fix the SLP vectorizer, but I don’t quite know where it
> needs fixing. Below is the code it is producing:
> 
> @ BB#4:                                 @ %.lr.ph
>        vmov.32 d16[0], r7     /// Costly
>        adr     r1, .LCPI1_0
>        mov     r2, #1
>        sub     r7, r5, #32
>        vld1.64 {d18, d19}, [r1:128]
>        rsb     r1, r5, #32
>        vmov.32 d16[1], r6     /// Costly
>        lsr     lr, r2, r1
>        ldr     r3, [r8, #12]
>        cmp     r7, #0
>        lslge   lr, r2, r7
>        lsl     r12, r2, r5
>        vmovl.u32       q8, d16
>        add     r3, r3, #8
>        mov     r7, #0
>        vshl.u64        q8, q9, q8
> .LBB1_5:                                @ =>This Inner Loop Header: Depth=1
>        vmov.32 r4, d17[1]     /// Very costly
>        vmov.32 r2, d16[1]     /// Very costly
>        ldr     r6, [r3]
>        ldr     r5, [r3, #4]
>        orr     r2, r2, r4
>        vmov.32 r4, d17[0]     /// Very costly
>        vmov.32 r1, d16[0]     /// Very costly
>        bic     r2, r2, r5
>        orr     r1, r1, r4
>        bic     r1, r1, r6
>        orrs    r1, r1, r2
>        bne     .LBB1_7
> …
> 
> 
> I suppose the question here is, is this the SLP vectorizer’s fault? it’s
> obviously producing cross-class moves, but these could be moved out of the
> loop at the cost of increasing register pressure. So I suspect there are
> two problems:
> 
>  (1) SLP vectorizer doesn’t know that cross-class copies are costly. How
> do we fix this? Where’s the hook to attach a penalty to an inserted
> “extractelement”?
>  (2) The backend doesn’t know that cross-class copies are costly, and
> that spills would be better (if we increased reg pressure enough to
> spill). How do we fix this? I know Quentin has done a bunch of work on
> cross-class copies - is there a hook missing for A32?
> 
> Cheers,
> 
> James
> 
> On 11/09/2014 08:20, "James Molloy" <James.Molloy at arm.com> wrote:
> 
>> Hi Louis,
>> 
>> That’s interesting. We certainly haven’t seen that kind of regression on
>> any benchmark, at least not on AArch64. I’m surprised it’s affecting ARM
>> more than AArch64 as this is a pure midend commit.
>> 
>> I have been narrowing down a regression caused by this commit of about 3%
>> in namd, which I think I’ve solved now. I’ll try and reproduce this on
>> AArch32 today and get to the bottom of it.
>> 
>> Cheers,
>> 
>> James
>> 
>> On 10/09/2014 23:32, "Louis Gerbarg" <lgg at apple.com> wrote:
>> 
>>> This patch seems to be causing some performance regressions. In
>>> particular, I am seeing 30-50% performance regressions for libquantum
>>> targeting armv7 at -Os and -O3 across several different devices. Do you
>>> see similar regressions? Any thoughts?
>>> 
>>> Thanks,
>>> Louis
>>> 
>>>> On Sep 4, 2014, at 6:23 AM, James Molloy <james.molloy at arm.com> wrote:
>>>> 
>>>> Author: jamesm
>>>> Date: Thu Sep  4 08:23:08 2014
>>>> New Revision: 217144
>>>> 
>>>> URL: http://llvm.org/viewvc/llvm-project?rev=217144&view=rev
>>>> Log:
>>>> Enable noalias metadata by default and swap the order of the SLP and
>>>> Loop vectorizers by default.
>>>> 
>>>> After some time maturing, hopefully the flags themselves will be
>>>> removed.
>>>> 
>>>> Modified:
>>>>   llvm/trunk/lib/Transforms/IPO/PassManagerBuilder.cpp
>>>>   llvm/trunk/lib/Transforms/Utils/InlineFunction.cpp
>>>> 
>>>> Modified: llvm/trunk/lib/Transforms/IPO/PassManagerBuilder.cpp
>>>> URL:
>>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/IPO/PassMa
>>>> n
>>>> agerBuilder.cpp?rev=217144&r1=217143&r2=217144&view=diff
>>>> 
>>>> ========================================================================
>>>> =
>>>> =====
>>>> --- llvm/trunk/lib/Transforms/IPO/PassManagerBuilder.cpp (original)
>>>> +++ llvm/trunk/lib/Transforms/IPO/PassManagerBuilder.cpp Thu Sep  4
>>>> 08:23:08 2014
>>>> @@ -62,7 +62,7 @@ static cl::opt<bool> RunLoadCombine("com
>>>> 
>>>> static cl::opt<bool>
>>>> RunSLPAfterLoopVectorization("run-slp-after-loop-vectorization",
>>>> -  cl::init(false), cl::Hidden,
>>>> +  cl::init(true), cl::Hidden,
>>>>  cl::desc("Run the SLP vectorizer (and BB vectorizer) after the Loop "
>>>>           "vectorizer instead of before"));
>>>> 
>>>> 
>>>> Modified: llvm/trunk/lib/Transforms/Utils/InlineFunction.cpp
>>>> URL:
>>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Utils/Inli
>>>> n
>>>> eFunction.cpp?rev=217144&r1=217143&r2=217144&view=diff
>>>> 
>>>> ========================================================================
>>>> =
>>>> =====
>>>> --- llvm/trunk/lib/Transforms/Utils/InlineFunction.cpp (original)
>>>> +++ llvm/trunk/lib/Transforms/Utils/InlineFunction.cpp Thu Sep  4
>>>> 08:23:08 2014
>>>> @@ -42,7 +42,7 @@
>>>> using namespace llvm;
>>>> 
>>>> static cl::opt<bool>
>>>> -EnableNoAliasConversion("enable-noalias-to-md-conversion",
>>>> cl::init(false),
>>>> +EnableNoAliasConversion("enable-noalias-to-md-conversion",
>>>> cl::init(true),
>>>>  cl::Hidden,
>>>>  cl::desc("Convert noalias attributes to metadata during inlining."));
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>> 
>>> 
>>> 
>> 
> 
> 
> -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.
> 
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2548782