[PATCH] D82709: [MachineLICM] [PowerPC] hoisting rematerializable cheap instructions based on register pressure.

Thu Aug 20 09:17:50 PDT 2020

shchenz added a comment.

Hi @efriedma after a long time investigation about greedy register allocation, I have some findings. I think the reason why the remat `lis` is not sinked down by RA as our expected is the limitation of current greedy register allocation. Hi @qcolombet sorry to bother you, If I am wrong at the comment about greedyRA, please correct me. ^-^

after MachineLICM (with all `LIS` and some `ORI` hoisted up), for the new added testcase, we get:

  bb0:        ; outter loop preheader
      outteruse1 = 
      outteruse2 = 
      ....
      outteruseN = 
      ....
      lisvar1 = LIS
      orivar1 = ORI lisvar1
      lisvar2 = LIS
      orivar2 = ORI lisvar2
      ....
      lisvarm  = LIS
      orivarm = ORI lisvarm        <------ m ORI (together with related LIS) are hoisted out under register pressure.
      lisvarm+1 = LIS
      lisvarm+2 = LIS
      ...
      lisvarN = LIS        <------ all LIS are hoisted out because of remat.
  bb1:            ;  inner loop preheader
      MTCTR8loop    <-------hardware loop, set loop count
  bb2:
      std orivar1
      std orivar2
      ......
      std orivarm
      orivarm+1 = ORI lisvarm+1
      std orivarm+1
      orivarm+2 = ORI lisvarm+2
      std orivarm+2
      ......
      orivarN = ORI lisvarN
      std orivarN
      bdnz bb2   <--------hardware loop, test count register and branch
  bb3:
      std outteruse1
      ....
      std outteruseN
      conditional-branch bb1, bb4
  bb4:
    ret

In greedyRA, all live intervals are put inside a priority queue. And live interval with high priority will be assigned with physical register first. The bigger the live interval's size, the higher priority the live interval has. So in above code sequence, `outteruse1`, ... `outteruseN` will be assigned with physical register earlier than `lisvar` and `orisvar`.

So after greedyRA stage `RS_Assign`, `RS_Split`, `outteruseN`are the first to enter `RS_Spill` stage. Issue here is when we try to spill for `outteruseN`, greedyRA will not try to do rematerialize for low priority remat `LIS` instructions in advance. I think maybe this is why it is called greedy register allocation. It always handles live interval one by one? After spilling for `outteruseN`, greedy register allocation marks allocation for this live interval as done. It won't be changed later.

(When some remat instruction needs to be spilled, they will be rematerialized to front of their use as expected.)

After greedy register allocation, code sequence is like:

  bb0:        ; outter loop preheader
      outteruse1 = 
      spill outteruse1 to stack.1   <------ spill; these spills can be saved if we rematerialize all the below LIS to their uses.
      outteruse2 = 
      spill outteruse2 to stack.2   <------ spill
      ....
      outteruseN = 
      spill outteruseN to stack.N  <------ spill
      ....
      lisvar1 = LIS
      orivar1 = ORI lisvar1
      lisvar2 = LIS
      orivar2 = ORI lisvar2
      ....
      lisvarm  = LIS
      orivarm = ORI lisvarm
      ...
      lisvarN = LIS        <------ not all of the remat LIS are rematerialized because there is no need to do that, outteruse are already spilled.
  bb1:            ;  inner loop preheader
      MTCTR8loop
  bb2:
      std orivar1
      std orivar2
      ......
      std orivarm
       lisvarm+1 = LIS     <------ rematerialized
      orivarm+1 = ORI lisvarm+1
      std orivarm+1
      lisvarm+2 = LIS       <------rematerialized
      orivarm+2 = ORI lisvarm+2
      std orivarm+2
      ......
      orivarN = ORI lisvarN
      std orivarN
      bdnz bb2
  bb3:
      reload outteruse1 from stack.1 <------reload
      std outteruse1
      ....
      reload outteruseN from stack.N  <------reload
      std outteruseN
      conditional-branch bb1, bb4
  bb4:
    ret

greedyRA can not foresee that there are many remat instruction but with low priority in greedyRA priority queue when it tries to do spill for some non-remat registers but with high priority. This should be greedy register allocation's limitation. So I think maybe the best way is machinelicm hoist the `LIS` also based on register pressure.

Sorry for the long comments @efriedma . You comments are quite welcome. BTW: We found some obvious improvement for some benchmarks with this change on PowerPC target.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D82709/new/

https://reviews.llvm.org/D82709