<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/63825>63825</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
ARM: -Oz fails to unroll simple load/store loop, generates needlessly large code
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
johnstiles-google
</td>
</tr>
</table>
<pre>
I am running Clang 16 (not Apple's version) on an M1 Mac. I have some code which loads four scalars from memory, and then writes them out in vectorized form. Our project builds with -Oz because we want small code. However, Clang decides against unrolling the loop (for unknown reasons) and instead generates loops which actually need to copy values onto the stack.
Here is a Godbolt demonstration: [https://godbolt.org/z/r3ecGfqT9](https://www.google.com/url?q=https://godbolt.org/z/r3ecGfqT9&sa=D&source=buganizer&usg=AOvVaw1sXKFKFFfMUkdzJ6CdNRul)
Replacing -Oz with -O3 gives better code generation. Unrolling the loop allows all of that work to melt away, and we can do the loads and stores directly via registers. No stack traffic is needed.
Note that we can force the unroll to occur via a #pragma, which generates -O3 quality code. However, this approach doesn't scale well—I can't review all the loops in my code and find places to manually add pragmas. Demonstration: [https://godbolt.org/z/E38feYWPd](https://www.google.com/url?q=https://godbolt.org/z/E38feYWPd&sa=D&source=buganizer&usg=AOvVaw17h88Ph1P4kMrIaGzpurzW)
Also, it seems as if further easy optimizations are possible. Clang is generating addresses using add instructions, but this is unnecessary. It could use repeated post-index immediate offsets to march the pointer forward in memory. This is apparently safe and does not incur a performance penalty on Mac ARM CPUs. I am told it has a performance penalty _only_ on the Cortex A55, which is a CPU that has never been used in any Apple device.
(Going even further, Clang could use ld4r and stp in order to copy four registers at once and complete all the work in two ops; this even removes the need for post-index immediate offsets entirely. However, that seems like a stretch goal.)
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJysVk1z4zYS_TXQpcssibS-DjporGjiTXnGNZXZ7J5SLaJJIgbRHDQoRvr1W01q7GRSuzVblYtJiwC6-73Xr4Eirg5EO7N8Z5aHGfap4bj7jZsgyXmSu5q59jQ7sb3sHgFbiH0ILtTw4DHUsFiByTeBE-y7zpPJ1wJniuI4mHwLHAADPC3gCcsMHqHBM4FwS1CyJRgaVzbgGa1AxX0EKdFjFKgit9BSy_Fi8gfAYCE1FGCILpHoewvcJ3ABzlQmju5KFiqObQYf-whd5N-oTHDqnbcCg0sN3H28wolK7IVgIBgwJJAWvR9zyeBHHuhMUeNNtVkqnSUBrNEFSdCHyN5r7akh8Myd1l5xhD68BB4CRELhIFq5pqy7CC3UFCiiJq6b5FY2lqlH7y8QiCwkhpK7C5zR9yTAIfEYRhKWL5mZH8x8P_39kSKBE0B4z_bEPoGlloOkiElhL_Zglu-alDoxxd7kR5Mf62llxrE2-fFq8mMsqHxfffl5a5YHk2_-vHwYhmwiPiu5Nfmxj94Uxy-mOHz_uflK0BSHg75wH0syxeHU1xjcVVFe9VKb4rD_eP4nDgv510_Hn47H6unzi73-Y_VgP3zqvcm3fyz8E3UeS8VfqbxxWkDtziRwopQoTrK64e04ZPD5r6Sh9zyIPoArSA0mGDi-KAUt-QQ44KvqBoISA1i-bVel6u-SOJKAdZHK5C9wdgiRaieJomTwgSfiIEWsKlcqX0oz2Qz-WNIHTnTLYApUcSxpjDWpTZPisuzjGAHB5EUXsW5RE5x09KYuReNLj96ly181nRrVTNdFxrIByyTB5Os0tpw2hPfmh9xs5mZ7_6ipjB8jnR0NI1Rf4RNtunYKMEJRuWBBmdHGZGgxTLJGa2HKVTI4fCvR79HoD8Wmon__8mz_Zo2-nft_aXTdbDbPzeL5_uUpPuL7a9fH6y_faHTvhRVtl0CIWgEUcBVUfUwNRSCUC3CXXOuuIxQCGAk6FnEnT9nNeZy8ajjUCmQkERLo5fb_aC2xL8cjNN6pTxPDTqAPgUoSwXjJ4DFByb23oLYXqSNMZDVgunPB0u_g2pasw0TAVSWUbiTGshkp79gFbayK44DRjuSPvpzBz7d42HUYKWgbCFaTKFRfoGPBBRUvQkdR7RlDSdBRQJ8uOh2esIT9pyd4eP4sOiCwhcTeKn4Nyn_Z9ysHf_lVt2uGDxwT_Q775fKtJ0Z7fHj-PPWWnhS0D-BEFBSJsQ4Ml2lqgaWzK-lPLmvyzXtWtOlM4St_b7PhDVNv7-PNEjo9laOl-Grn41R7NQbABKyV6PqS285TotfuGk3IBUgDA6uA302UjhlEavk8jb5pYOjc-Z80Ukgukr984wL4VZnevRAgSIqU1EYYfWby7czuCrsttjij3WK12S5WyzxfzZpdUeF6vVoSUrFYVaf71coWq6VdYzG31WpzP3O7fJ4X8_UiX-RFcT_PbIk5bcvNsljb7XI9N_dzatH5zPtzq_04cyI97VbFJl_OPJ7Iy3gTyfNAA4wfTZ7rxSTudM_dqa_F3M-9kyRvpySXPO32n57UWHQ2VOj8qOObiYpTqEf7Nvlx9O7RyxSRN_tUWD2J-At4jPV0R5n10e--MRSXmv50cx1N4va4u906TH4cUxeTH8fS_hMAAP__Pbc25w">