[PATCH] D36134: [ARM] Improve loop unrolling for Cortex-M

Thu Aug 3 12:59:23 PDT 2017

efriedma added a comment.

Okay, I can try to expand on the "limitation of unrolling infrastructure" bit.  The question is, given that I have a small loop, and have an expression for the trip count of the loop, what's the optimal way to generate code for the loop?

When the iteration count is generally large, and the CPU doesn't have special hardware for looping quickly, we want to unroll a bunch of times to reduce the iteration overhead as much as possible, then generate a tiny remainder loop (where the performance doesn't really matter).

Okay, but what happens when the iteration count is small, but the loop runs many times?  The runtime-unrolled version of the loop is completely useless, and we end up spending most of our time in the remainder loop.  So what can we do about that?  One option is to fully unroll the remainder loop: we know the maximum trip count, so it's a straightforward unroll operation.  Or maybe we could runtime-unroll the remainder loop (and generate a remainder loop for the remainder loop).  Or maybe we could try to do something fancy with a switch.  I'm not sure what option is best without actually testing it.  But there are definitely options here, and we can probably do better than just  setting "DefaultUnrollRuntimeCount = 2" to dodge the issue.

https://reviews.llvm.org/D36134