[llvm] [LoopUnroll] Clamp PartialThreshold for large LoopMicroOpBufferSize (PR #67657)

Wed Oct 4 07:33:50 PDT 2023

nikic wrote:

> I've always struggled with the idea of LoopMicroOpBufferSize in general. It appears to be just used as a way of throttling unrolls, to avoid the potential cost of unnecessary compare+branch.

I believe the primary purpose of LoopMicroOpBufferSize here is to make sure that we don't runtime-unroll past the loop buffer size: Doing so could mean that a (non-unrolled) loop that previously used the loop buffer may no longer do so after unrolling, which would be a clear pessimization.

So I think it should primarily serve as an upper bound on runtime unrolling. However, in practice we take it as the desired target, i.e. try to unroll all loops that we can to the loop buffer size. I'm not sure this part really makes sense, at least for the kind of wide out-of-order cores we're talking about here (see also https://github.com/llvm/llvm-project/issues/42332#issuecomment-1207719729).

Possibly the more correct fix here would be to set LoopMicroOpBufferSize=0 for these subtargets and disable runtime unrolling entirely. My current proposed fix is just a very conservative starting point.

>  In which case shouldn't we be trying to better estimate the cost of the loop control flow, based off branch distance etc? That way we avoid estimation issues for cpus without a dedicated loopback buffer etc.

Can you explain in more detail how we would estimate the cost/benefit of runtime unrolling? It's not really clear what we can do here beyond the fact that the compare and branch go away, as we generally don't expect runtime unrolling to result in any simplification beyond that.

https://github.com/llvm/llvm-project/pull/67657