<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/98631>98631</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            RegStackify can change instruction schedule and cause register spills at runtime

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          yolanda15

      </td>

    </tr>

</table>

<pre>

    With more unrolling and ternary operations, it is possible to create a deep stack when generate Wasm code.

For example, the [dwconv kernel](https://github.com/google/XNNPACK/blob/master/src/f32-dwconv/gen/f32-dwconv-9p16c-minmax-fma3.c#L84) in XNNPACK will calculate 9 fmadd and accumulates them to a single value in the loop body. During the RegisterStackify phase, instructions will be moved to create a deep stack when visit the inputs of these fmadd. It changed the order scheduled by previous instruction scheduling pass and caused register spills when running with V8's optimizing compiler (Turboshaft or Turbofan) targeting x64 platform. Considering the compile time restriction, V8 does not enable instruction-scheduling by default to reduce the regsiter pressure at runtime.

The [reg-stackify-dump.txt](https://github.com/user-attachments/files/16066680/reg-stackify-dump.txt) shows the dumped IR before and after RegStackify. You can find the last load (%24) used by the last madd will be moved to the first load after RegStackify.

Using memory operand can help mitigate the register pressure a bit at runtime, but there are two issues make it not a good option:

1. Even though it partially reduce the register uses with less spills, it can only bring less than 5% performance gain in microbench. In comparison, reducing the stack depth with a threshold, it can bring 2x speedup for this kernel.

2. There are cases that memory operand cannot be generated, since there maybe stores between the use instruction and loads. 

E.g. the dwconv kernel contains two stores in the loop. Both stores can grow to deep stacks. RegStackify starts from the 2nd store, but the loads used by the 2nd store (Loads2) cannot be moved after the 1st store, while the fused madd instructions can. Later the 1st store will get stackified with another deep stack that moves instructions right before itself. These instructions are all moved after pevious Loads2 and makes Loads2 become more distant to the 2nd store. This means the live range of the Loads2 will cross the whole stack of the 1st store. I Attached the dumped IR [reg-stackify-dwconv.txt](https://github.com/user-attachments/files/16085588/reg-stackify-dwconv.txt) after RegisterStackify for reference. At runtime, we cannot take Loads2 as memory operand if they are used after the 1st store considering side effect.

Shall we add a threshold to stackify if the target system has limited registers? In practice, we may create different kernels for different platforms considering the cache or register rnumber difference for wasm. We may at least use a flag to control the stackify depth as well. I have created a [PR ](https://github.com/llvm/llvm-project/pull/97283) for this issue, it can bring 2x speedup for this dwconv kernel microbench.

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJysV92O2zjSfRr5pmDBptp_F77oJGMg-ILBIJNvsntZokoSNxQpsCi7vU-_KEryz3SwswvsTbtNUcWqc06dopHZNI7omG0-ZJtPCxxi68Px6i26CtebRemr6_G7iS10PhAMLnhrjWsAXQWRgsNwBd9TwGi840x9BBPBMPSe2ZSWIHrQgTASIFREPXBE_QMuLTloyMmbBN-RO9C-ojxbfcpWrycfgN6w6y1JyNgSZJsP1UV7d4YfFBzZbPMpU_s2xp6z4jVTp0ydGhPbocy17-SL9428fvrbr7_-9vrx_zJ1Kq0vM3XqkCOFTJ046Eyd6kItx9DyFrmnpeWhX2_1sjOuw7dl3WGR60wVX_YvmTqAcTBFh4uxFjRaPVip6AB1h1WVcEKthy4ts5TSCSYIbFxjCc5oB5JAUqT1vgfBPIdPQxCcZfUrNUYy_l2QM_UV-hY5AWMcxzDohP2YQUnQ-TNV_xb3s2ETU2jj-iEy-Fq-MY1J5_A5gm7RNRKnJfChogCsW6oGSxWUV-gDnY0f-DGFeYfk3SNzKl7jwFRBmGoA7o21POYRBudk80UE9sc-UzsG30fTmX_KsvZdbywFyNT-2xBKzy3WEXyA9K1GJxxEDA1F2f-2fYHeYqx96HL46B2bim4oTtEgmo4gEMdgUtaC4x97qDwxOB-BHIpuH-paPtRVXqGiGgcbBeFA1aAphQ_UsJEC-0DMQyDAKAXKcZOqx7_fRjEHapY8Ebqshq7P41v8S1EPTGGJMaJuO3KRRavGknyut6vtdrtfZer089jqANz6S9IgyCpV8PkrlFRLayeh1lLAV2pmpeXwdz-ARge1caMWLHIE67ESVjK1UakREsfl9b4jif-dIuVxbcIc4f15j0D9v3QIdNT52WOSnBy0ZHvoTDSNyHsCf1TXHXwoTXygQFguhyR6eRoI4sWDYR6IocMfJL4l9CM03ldJht4JCSmXdQ6_nEma1A9NK3t7DNGgtdc_iWDMY2DiUdaWmCfRT-4oJXhnr1AmbaYNsUUHm0xtoKcg-kWnCRo0TqyhMzr4kpxuc_jskpIxGB61m46fRT42eUV9bMfjEWIbiFtvq4fjx5PVG3BPVA091D5AbA1P5jrxoHL4dsNLIyf_wvgTTgS5km6Gns5i40ZYAkGH11Ky84EYSooXotHxBn5qtSRD0QbnMObwS97ko2If3R-0dxGN40TjFPfBRHP44GM7P5CSm-AvIsG7F3L-KD1ZC5GhDr5LYZSrxvcfpDOm9qT22zbphy_yWElD3DEZxT9KXV5Yc7zHvbTJkqQtUtDUNk-urtHl8AXfvT02V0MRpk43VE2UOy-gP7r-yJo_Ez8HD6Zp4-wAJjLZOlH-TAon_tHap1r6aQCMNSfipI9uCyVp39F4b6gMR3RxtoAbZnKYYegI3WhL1pwJgkyeaSbN0cbxGjyP-y6tt7Pap403ZHL4DK_JIqfxdfe6d76bNPU_cN79ZrPfv3Pee3R1uJvd8yyXzgtUUyCnKYfXJ8e60CykKBY1Q81_7kCTILgmnpKOfiI3aZnbQJR_gOqadHzy3N9boflCkO4ud-8Q5ua6ptOmwQt85UgdtMhgTWfiw7DnrDiJYfUBdTR6LqnD63w3qUydSo9TX3PC4746j3N-yj6Nc-EXEniT5wY3dCXd39aUgl2Quxy-j8diBEsyn8R3EGqLTbopeReDt3cLlTJHF0WGC1kromrxTFPiAk62-fCbSOovhGPtef5Y9sH_g3TM1KkfrM3U6bBT-0LUcTPgNJL-I6t-NsSHGbGojkV1KA64oON6p1a7w1qtV4v2SLut3qtDudvusCpovVXlvtpV-5d6tS1I7RfmqFbqZbVbq3VRFMUqJ11tcFMUWGzrtcJd9rKiDo3NpZzch2aR8j0e9ttivbBYkuX0Y0IpR5e5GCW_LcIxQVAODWcvK2s48j1KNNHS8dGOpfbxDvqzKybd75bvrpb3ob8Ygj3-19SMlwIhJxV1Pqp_BQAA__9Yc5_E">