[PATCH] D34619: [ARM] Enable partial and runtime unrolling

Wed Jun 28 11:32:47 PDT 2017

efriedma added inline comments.

================
Comment at: lib/Target/ARM/ARMTargetTransformInfo.cpp:599
+  // Enable runtime unrolling for non-nested loops.
+  if (L->getLoopDepth() == 1) {
+    UP.Runtime = true;
----------------
samparker wrote:
> efriedma wrote:
> > These heuristics seem really weird... why does the loop depth affect the performance of a loop?  Why does the trip count of the parent loop affect the performance of a loop?
> > 
> > I mean, I can imagine these heuristics are profitable for your particular benchmarks, but it seems like a performance trap: for example, someone turns on LTO, so a function gets inlined into a loop, so we decide it's no longer profitable to unroll the inner loop, and we lose a bunch of performance.
> Hi Eli,
> 
> The depth check here is just for an early exit. For nested loops, I'm checking whether the trip count of the inner loop will be the same for each  iteration of the outer loops, it's not the actual trip count that affects performance.
> 
> I have found that its not profitable to unroll loops with calls because it can prevent the inlining from happening, so unrolling will not happen in those cases and I hope that the inlining does occur. Some inlining could push the cost above the unrolling threshold, but that doesn't matter because we're still getting the inlining performance gains. The thresholds have been chosen to unroll small loops where the overhead of the backedge, on m-class cores, is significant.
> 
> cheers,
> sam
> The depth check here is just for an early exit. For nested loops, I'm checking whether the trip count of the inner loop will be the same for each iteration of the outer loops,

This is still problematic.  You're basing the question of whether to unroll the loop based on code which is not part of the loop, which means the end result is going to be sensitive to unrelated inlining decisions.

---

Scanning the loop for calls seems fine.

================
Comment at: test/CodeGen/ARM/loop-unrolling.ll:152
+; CHECK-LABEL: nested_runtime_variant
+define arm_aapcs_vfpcc void @nested_runtime_variant(i32* nocapture %C, i16* nocapture readonly %A, i16* nocapture readonly %B, i32 %N) local_unnamed_addr #0 {
+entry:
----------------
Why do we want to avoid unrolling here?  At first glance, this looks like it should be profitable to unroll (more scheduler freedom, avoid branches).

https://reviews.llvm.org/D34619