[PATCH] D51780: ARM: align loops to 4 bytes on Cortex-M3 and Cortex-M4.

Mon Sep 10 06:51:43 PDT 2018

dmgreen added a comment.

> Interesting question. There are a couple of reasons to think the tradeoffs are different there. First, I'd probably (based purely on intuition rather than data) expect a loop to be executed more times than most functions. Second, when r7 is the FP, a function is almost guaranteed to start with a 16-bit instruction (`push {..., r7, lr}`).

The counter argument would be that in the case of function alignment, you don't need to execute the added NOP. For loops, there's always the small chance that you could be running an outer loop more than the inner one, leading to a executed nops that each take a cycle (unless it's the M33 and you get lucky with it's dual issue). If you don't care about the codesize, I think function alignment it will always be a win because the extra padding is just in unused space.

But let us know how the benchmarks look.

Repository:
  rL LLVM

https://reviews.llvm.org/D51780