[llvm] [ModuloSchedule] Implement modulo variable expansion for pipelining (PR #65609)

Tue Sep 26 06:40:36 PDT 2023

ytmukai wrote:

> Could you describe this a bit more (for example how are low trip counts handled?)

The pipelined code is only selected if the trip count is more than the amount consumed by the prologue/epilogue and one iteration of the kernel. If less than that, the original loop is selected. The original loop is also used to handle the remainder of the unroll by MVE.

Details of the control flow are described below:
https://github.com/llvm/llvm-project/blob/9f5dc31d11100d7c82dd6c86d054fea2f7b4138c/llvm/lib/CodeGen/ModuloSchedule.cpp#L2154-L2225

> I'm actually not sure if the upstream LLVM swing scheduler stitches the prolig/epilog schedule for low trip counts or whether it versions the loop.

The upstream implementation has no unrolling and also allows branching from the prologue to the epilogue, so that any trip count is handled only by pipelined code. 

If we adopt a branch from the prologue to the epilogue, we should not be able to schedule those parts. Therefore, we did not employ that in this implementation. A further reason is that the original loop is needed to handle the remainder of the unroll anyway. (It might be possible to solve this problem by branching out into an epilogue in the middle of the kernel, but I believe it would require a large number of copies or multiple versions of the epilogue.)

> Could you comment on the cost model used to unroll per MVE? It should only be needed if there are copies, or an excessive number of copies.

Actually, this implementation unrolls a number of times such that it requires none of the copies. A large number of unrolls has the disadvantage of higher required trip counts and increased code size. Since that disadvantage and the cost of copies are not directly comparable, I feel it is difficult to consider those tradeoffs.

It may be a good idea to choose the minimum number of unrolls where the ResMII, recalculated considering the resources consumed by the copies, does not exceed the scheduled II. It may also be necessary to check that the latency of the copies does not extend the critical path.

This patch does not replace the existing process, but adds a new entire process that uses MVE. We believe that the first step is to choose the appropriate method depending on the processor. For example, Intel processors seem to have an optimization that handles copies of floating-point registers at a lower cost in some cases. For processors without such optimizations, copy intructions often leads to an increase in II, so I think it may be better to unroll it sufficiently.

https://github.com/llvm/llvm-project/pull/65609