[PATCH] D62847: [PowerPC] reorder LSR and PPCCTRLoops pass

Wed Jun 5 02:17:55 PDT 2019

shchenz added a comment.

In D62847#1529703 <https://reviews.llvm.org/D62847#1529703>, @hfinkel wrote:

> I'm really uneasy about this change. PPCCTRLoops inserts special loop intrinsics which the rest of the loop-optimization pipeline doesn't understand, and so logically, it should come last. It's pretty clear that LSR is interfering with PPCCTRLoops, is this because it transforms the loops into some form that SCEV doesn't understand? Could we adjust LSR to make it more friendly to hardware-assisted loops? In theory, if LSR had the right cost model, it could actually make *more* loops that PPCCTRLoops could handle, and not fewer?

Thanks for your comment Hal @hfinkel . I create this patch mostly because of the suboptimal loop instructions in some testcases. Making more hardware loops is not this patch's original intention. Seems reorder these two passes can generate more hardware loop. One suboptimal loop instructions example is in file `test/CodeGen/PowerPC/lsr-ctrloop.ll`, we see that there is one redundant instruction `add 5, 3, 4` inside the loop. After investigate this issue, we found this is caused by `ICmpZero` `LSRUse`. After LSR gathers all formulaes for all LSRUse, for the case, all stores inside the loop is a user for both addrec `reg({0,+,384}` and `reg({(192 + %0),+,384})`, but `ICmpZero` `LSRUse` only has use for addrec `reg({0,+,384})`, so when LSR try to narrow down the search space, it selects addrec `reg({0,+,384})` as a winner since its usage number is bigger than `reg({(192 + %0),+,384})`. But the issue is: by using `reg({(192 + %0),+,384})`, we can get better code in LSR. You can see the difference in the testcase.

Yes, I tried to add a target hook function in `PPCTargetTransformationInfo` class, let's say `canConvertToHardwareLoop`. This function should be called in LSR through `TTI` and in PPCCTRLoops to check if a loop can be converted to hardware loop. And it should come from current condition check code in PPCCTRLoops. But there is some disadvantage about this solution:
1: The condition check code in PPCCTRLoops pass uses many Analysis Pass result, like `LoopInfo`, `SCEV`, `DominatorTree`, `AssumptionCache`, `DataLayout`, `TargetLibraryInfo`, `TargetTransformInfo` and so on, we need to pass all these infos from LSR/PPCCTRLoops to the target hook funciton in `PPCTargetTransformationInfo`. When I implemented like this way, I am a little worried this maybe a little 'heavy' for a target hook compared with other exist functions in `PPCTargetTransformationInfo`? Seems now https://reviews.llvm.org/D62604 implements like this, but at that time when I did this, it is not so straightforward for me.
2: Most important reason is, all above Analysis Pass results may be different at the point LSR calling `canConvertToHardwareLoop` and the point PPCCTRLoops calling `canConvertToHardwareLoop` , so it is possible that `canConvertToHardwareLoop` returns different value to LSR and PPCCTRLoops, then the ICmpZero is still not accurate and LSR still generates suboptimal instructions.

For the loop intrinsics generated before LSR, as I can see, LSR can still recognize it as a loop and I saw `PPCLoopPreIncPrep` can also transform the loop with intrinsics to a pre-inc loop when I tested it. So I guess LoopInfoWrapperPass can still recognize it as a Loop? Maybe you mean the loop trip count can not be got by the loop since it is now in loop preheader by calling the intrinsics? I am not sure about the impact since some impartant loop passes depending on loop trip count like loop unroll, loop vectorize are all before LSR and PPCCTRLoops. It would be great if you can give more detail about the impact?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D62847/new/

https://reviews.llvm.org/D62847