<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/63833>63833</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            ARM: -O3 avoids post-index immediate offset instructions unnecessarily
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          johnstiles-google
      </td>
    </tr>
</table>

<pre>
    Consider the following loop, which copies scalar data into vectors: [https://godbolt.org/z/E38feYWPd](https://godbolt.org/z/E38feYWPd)

Clang is generating addresses using add instructions, but this is unnecessary. It could use repeated post-index immediate offsets to march the pointer forward in memory. This is apparently safe and does not incur a performance penalty on Mac ARM CPUs. I am told it has a performance penalty only on the Cortex A55, which is a CPU that has never been used in any Apple device. Even if it were slower, this would generate _smaller_ code, which is what -Oz is designed to do.

This approach would save two instructions:

 add     x8, x0, w1, uxtw
        add     x11, x0, x1, lsr #32
 ld1r    { v0.4s }, [x8], #4
        ld1r    { v1.4s }, [x8], #4
 ld1r    { v2.4s }, [x8], #4
        ld1r    { v3.4s }, [x8]
 stp     q0, q1, [x11]
        stp     q2, q3, [x11, #32]
 ret

For _even smaller_ code, Clang could even leverage `ld4r` to load all four scalars at once. In this case we have three fewer instructions, and wouldn't even need offsets at all.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJycVE-P27gP_TTKhUhgy_l78CGTaYAeihbF74fFngaMxcQqZNEV5Tjpp19Iyc5MtvsPq4OiwO_xUeQTUcSePFGtFk9q8TzBIbYc6m_ceonWkUxPzCdHkwOba71jL9ZQgNgSHNk5Hq0_gWPuld7B2NqmhYZ7SwLSoMMABiOC9ZHhTE3kIKraglo8tTH26az0Xun9ic2BXZxxOCm9_6H0_kO1PtKvv3wxavGs9Prfw_VGFc-q2N72nUN_AitwIk8BY0oXjQkkQgKD3P-D9RLD0ETLXtJVDkOE2FpJ1MF7akgEw3UGHyM0PDgDgxAE6gkjGehZ4tR6QxewXUfGYiTg41EoCkSGDkPT5qr1bH2kAEcOI4YkDB11nEL_766HfY-BfHRXEDwSoDdgmAQ8R7C-GQIg9BSOHDr0DUFPHl28Anv4hA1sv36C3Zf_yww-AnYQ2RmwEVqUv-S5TE757ThEusB2sXjraMopRYTY4i2OpzMFOBD5VId8C_RX2Pa9IzB0tg3N4MOZPNhj0h4pEIjjkUIKmys75jLe-0LwIh06R-EFGjb0ID4m2ennH-lsKBvWpKoanr3vda4f9n1gbNp7eMEzQRz5scHV9j0vGyCtyzrJXoosXqZ9uMTxDrqvV2xZvoEv-ewkgNJVpe8MZ8qQoGr1BOdiNhdQq-cEVIunyzr7epcI80eFB1r5z7QHvP5vMtWf0W5wiX3Gf883_V7-DinLN8x9vUJ1hlbvoLcUKv3GCRTfN2HPAV4oOeYnH9ze8O3VZYRL7sMTgVoWzsyDWhbJDo7RADoHRx7Cff4IYAT2yY4f_c13DQrBSNBma7SBCI40UvhpBqR3l13klV7Fm7QnMq8PG2OSm01MXZlNtcEJ1eVyvSk31bqoJm29LhaoV4e5LubrcnMoaEHVsqFiiQtjzBonttaFropVqXVRFUU125TloZnjfDnfbKpVs1Lzgjq0bubcuUvjbmJFBqqX1bqqJg4P5CSPbq09jZA_Kp2qPAl14kwPw0nUvHBWorxFiTY6qrdfP6VpPP1cAZ7ZGvm7OfZQnncz0brrZAiu_sOEtrEdDrOGO6X3Sfb-M-0Df6MmKr3PyYrS-3yZ3wIAAP__tmAOXQ">