<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">

<p dir="ltr">Great, that explains it. I'll cobble a patch for a15 tomorrow, but this is a microarchitectural effect, so I'm not sure I can fix this for compilations without -mcpu.</p>

<p dir="ltr">Cheers,</p>

<p dir="ltr">James</p>

<p dir="ltr">Sent from my Sony Xperia™ smartphone</p>

<br>

<br>

---- Louis Gerbarg wrote ----<br>

<br>

<div>The slowdown occurred executing code compiled for armv7 on both swift and cyclone. Checking quickly on the tester I have near me (cyclone) at -O3 I see no slowdown when compiled for arm64, and a minor (~3%) slowdown when compiled for armv7s.

<div class=""><br class="">

</div>

<div class="">Louis </div>

<div class=""><br class="">

<div>

<blockquote type="cite" class="">

<div class="">On Sep 11, 2014, at 10:33 AM, James Molloy <<a href="mailto:james.molloy@arm.com" class="">james.molloy@arm.com</a>> wrote:</div>

<br class="Apple-interchange-newline">

<div class="">

<meta name="Generator" content="Microsoft Exchange Server" class="">

<!-- converted from text --><style class=""><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>

<div class="">

<p dir="ltr" class="">Hi,</p>

<p dir="ltr" class="">That makes total sense. Louis, we're the targets you saw this on swift or not swift? Arnold's explanation wouldn't explain a slowdown on swift.

</p>

<p dir="ltr" class="">Cheers, <br class="">

James </p>

<p dir="ltr" class="">Sent from my Sony Xperia™ smartphone</p>

<br class="">

<br class="">

---- Arnold Schwaighofer wrote ----<br class="">

<br class="">

<font size="2" class="">

<div class="PlainText">To be clear, there are two points I think are worth inspecting:<br class="">

* Increasing the cost of sub register moves (insert element, extract element) on armv7. insertelement will do for this benchmark.<br class="">

* Why is the cost of the scalar shift so high. I strongly suspect that is because we don’t support i64 scalar shifts but have not verified this.<br class="">

<br class="">

> On Sep 11, 2014, at 9:39 AM, Arnold Schwaighofer <<a href="mailto:aschwaighofer@apple.com" class="">aschwaighofer@apple.com</a>> wrote:<br class="">

> <br class="">

> <br class="">

>> On Sep 11, 2014, at 8:08 AM, James Molloy <<a href="mailto:James.Molloy@arm.com" class="">James.Molloy@arm.com</a>> wrote:<br class="">

>> <br class="">

>> Hi Louis,<br class="">

>> <br class="">

>> I’ve spent some time debugging this - looping Arnold in too because it<br class="">

>> involves the SLP vectorizer, and Quentin because it involves cross-class<br class="">

>> copies in the backend.<br class="">

>> <br class="">

>> The change itself is fairly innocuous. It does however permute IR which<br class="">

>> seemingly means that now, the SLP vectorizer decides to have a go at<br class="">

>> vectorizing right in the middle of the hottest function, quantum_toffoli.<br class="">

>> <br class="">

>> It ends up putting cross-class “vmov”s right in the middle of the loop.<br class="">

>> They’re invariant, and it’s obviously a terrible idea. AArch64 is not<br class="">

>> affected because:<br class="">

>> (a) it has a better cost model for arithmetic instructions, that allows<br class="">

>> (b) the loop vectorizer to have a crack at unrolling by 4x, because<br class="">

>> (c) there are more scalar registers available on A64.<br class="">

>> <br class="">

>> Incidentally as part of this I’ve just discovered that libquantum is 50%<br class="">

>> slower on A32 than A64! :/<br class="">

> <br class="">

> Yes, you don’t want to get  toffoli wrong if you care about lib quantum :).<br class="">

> <br class="">

> getVectorInstr estimates the cost for insert and extract element.<br class="">

> <br class="">

> On swift (armv7s) the penalty for insertelement saves us:<br class="">

> <br class="">

> <br class="">

> unsigned ARMTTI::getVectorInstrCost(unsigned Opcode, Type *ValTy,<br class="">

>                                    unsigned Index) const {<br class="">

>  // Penalize inserting into an D-subregister. We end up with a three times<br class="">

>  // lower estimated throughput on swift.<br class="">

>  if (ST->isSwift() &&<br class="">

>      Opcode == Instruction::InsertElement &&<br class="">

>      ValTy->isVectorTy() &&<br class="">

>      ValTy->getScalarSizeInBits() <= 32)<br class="">

>    return 3;<br class="">

> <br class="">

>  return TargetTransformInfo::getVectorInstrCost(Opcode, ValTy, Index);<br class="">

> }<br class="">

> <br class="">

> armv7:<br class="">

> <br class="">

> SLP: Calculating cost for tree of size 4.<br class="">

> SLP: Adding cost -6 for bundle that starts with   %shl = shl i64 1, %sh_prom . // This estimate is surprising to me.<br class="">

> SLP: Adding cost 0 for bundle that starts with i64 1 .<br class="">

> SLP: Adding cost -1 for bundle that starts with   %sh_prom = zext i32 %control1 to i64 .<br class="">

> SLP: Adding cost 2 for bundle that starts with i32 %control1 .<br class="">

> SLP: #LV: 0, Looking at   %sh_prom = zext i32 %control1 to i64<br class="">

> SLP: SpillCost=0<br class="">

> SLP: Total Cost -1.<br class="">

> <br class="">

> <br class="">

> armv7s:<br class="">

> <br class="">

> SLP: Calculating cost for tree of size 4.<br class="">

> SLP: Adding cost -6 for bundle that starts with   %shl = shl i64 1, %sh_prom .<br class="">

> SLP: Adding cost 0 for bundle that starts with i64 1 .<br class="">

> SLP: Adding cost -1 for bundle that starts with   %sh_prom = zext i32 %control1 to i64 .<br class="">

> SLP: Adding cost 6 for bundle that starts with i32 %control1 .<br class="">

> SLP: #LV: 0, Looking at   %sh_prom = zext i32 %control1 to i64<br class="">

> SLP: SpillCost=0<br class="">

> SLP: Total Cost 3.<br class="">

> <br class="">

> <br class="">

> for.body.lr.ph:                                   ; preds = %for.cond.preheader<br class="">

>  %node = getelementptr inbounds %struct.quantum_reg_struct* %reg, i32 0, i32 3<br class="">

>  %2 = load %struct.quantum_reg_node_struct** %node, align 4, !tbaa !10<br class="">

>  %sh_prom = zext i32 %control1 to i64  // + Cost of insert_element of [%control1, %control2] + Cost of zext [%control1, %control2]<br class="">

>  %shl = shl i64 1, %sh_prom            // + Cost of shl of [1, 1], [%sh_prom, %sh_prom8]<br class="">

>  %sh_prom8 = zext i32 %control2 to i64<br class="">

>  %shl9 = shl i64 1, %sh_prom8<br class="">

>  %sh_prom13 = zext i32 %target to i64<br class="">

>  %shl14 = shl i64 1, %sh_prom13<br class="">

>  br label %for.body<br class="">

> <br class="">

> for.body:                                         ; preds = %for.body.lr.ph, %for.inc<br class="">

>  %i.034 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.inc ]<br class="">

>  %state = getelementptr inbounds %struct.quantum_reg_node_struct* %2, i32 %i.034, i32 1<br class="">

>  %3 = load i64* %state, align 4, !tbaa !11<br class="">

>  %4 = or i64 %shl, %shl9 // We seemed to be starting the tree here upwards.<br class="">

>  %5 = and i64 %3, %4<br class="">

>  %6 = icmp eq i64 %5, %4<br class="">

>  br i1 %6, label %if.then12, label %for.inc<br class="">

> <br class="">

> <br class="">

> armv7:<br class="">

> <br class="">

> for.body.lr.ph:                                   ; preds = %for.cond.preheader<br class="">

>  %node = getelementptr inbounds %struct.quantum_reg_struct* %reg, i32 0, i32 3<br class="">

>  %2 = load %struct.quantum_reg_node_struct** %node, align 4, !tbaa !10<br class="">

>  %3 = insertelement <2 x i32> undef, i32 %control1, i32 0<br class="">

>  %4 = insertelement <2 x i32> %3, i32 %control2, i32 1<br class="">

>  %5 = zext <2 x i32> %4 to <2 x i64><br class="">

>  %6 = shl <2 x i64> <i64 1, i64 1>, %5<br class="">

>  %sh_prom13 = zext i32 %target to i64<br class="">

>  %shl14 = shl i64 1, %sh_prom13<br class="">

>  br label %for.body<br class="">

> <br class="">

> for.body:                                         ; preds = %for.body.lr.ph, %for.inc<br class="">

>  %i.034 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.inc ]<br class="">

>  %state = getelementptr inbounds %struct.quantum_reg_node_struct* %2, i32 %i.034, i32 1<br class="">

>  %7 = load i64* %state, align 4, !tbaa !11<br class="">

>  %8 = extractelement <2 x i64> %6, i32 0<br class="">

>  %9 = extractelement <2 x i64> %6, i32 1<br class="">

>  %10 = or i64 %8, %9<br class="">

>  %11 = and i64 %7, %10<br class="">

>  %12 = icmp eq i64 %11, %10<br class="">

>  br i1 %12, label %if.then12, label %for.inc<br class="">

<br class="">

<br class="">

</div>

</font><br class="">

<font face="Arial" size="2" class="">-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents

 to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.<br class="">

<br class="">

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590<br class="">

ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782<br class="">

</font></div>

_______________________________________________<br class="">

llvm-commits mailing list<br class="">

<a href="mailto:llvm-commits@cs.uiuc.edu" class="">llvm-commits@cs.uiuc.edu</a><br class="">

http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits<br class="">

</div>

</blockquote>

</div>

<br class="">

</div>

</div>

<br>

<font face="Arial" color="Black" size="2">-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents

 to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.<br>

<br>

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590<br>

ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782<br>

</font>

</body>

</html>