<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/144002>144002</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[LSR][flang][perf] Too much register pressure after LSR
</td>
</tr>
<tr>
<th>Labels</th>
<td>
flang
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
vzakhari
</td>
</tr>
</table>
<pre>
This issue is related to #143219: when flang adds `noalias` attributes to function arguments CPU2017/548.exchange2 slows down by more than 10%, for example, on skylake/icelake with `-Ofast -march=native`. The addition of `noalias` is enabled by `-mmlir -force-no-alias=true`, and disabled by `-mmlir -force-no-alias=false` (this is the default in flang now).
Here is a significantly reduced test that still demonstrates the issue: [repro.f90.gz](https://github.com/user-attachments/files/20717573/repro.f90.gz)
LSR generates more loop-carried `lsr.iv` values and the GEP instructions like `%scevgep643 = getelementptr i8, ptr %lsr.iv642, i64 324` end up being executing on the stack spill slots causing too much pressure on the memory.
```
$ flang -Ofast -march=skylake repro.f90 -mmlir -force-no-alias=false -mllvm -print-after=mergeicmps -mllvm -filter-print-funcs=_QMexchangePdigits_2 -mllvm -print-module-scope 2>&1 | grep "getelementptr.* 324" | wc -l
18
```
```
flang -Ofast -march=skylake repro.f90 -mmlir -force-no-alias=true -mllvm -print-after=mergeicmps -mllvm -filter-print-funcs=_QMexchangePdigits_2 -mllvm -print-module-scope 2>&1 | grep "getelementptr.* 324" | wc -l
$ 24
```
As long as LSR operates only on the innermost loops, it is optimizing the SCEV expressions separately for the main vector loop, the corresponding eiplog vector loop and (sometimes) the corresponding scalar loop.
For example, flang generates a loop for the following array expression:
```
30 block(rnext:9, 8, i8) = block(rnext:9, 8, i8) - 10
```
The loop is then vectorized into the following (showing the `-force-no-alias=true` IR):
```
vector.body353: ; preds = %vector.body353, %vector.ph350
%index354 = phi i64 [ 0, %vector.ph350 ], [ %index.next359, %vector.body353 ]
%174 = add i64 1, %index354
%175 = getelementptr i32, ptr %gep251, i64 %174, !dbg !51
%176 = getelementptr i32, ptr %175, i32 0, !dbg !51
%177 = getelementptr i32, ptr %175, i32 8, !dbg !51
%178 = getelementptr i32, ptr %175, i32 16, !dbg !51
%179 = getelementptr i32, ptr %175, i32 24, !dbg !51
...
vec.epilog.vector.body370: ; preds = %vec.epilog.vector.body370, %vec.epilog.ph364
%index371 = phi i64 [ %vec.epilog.resume.val362, %vec.epilog.ph364 ], [ %index.next374, %vec.epilog.vector.body370 ]
%offset.idx372 = add i64 1, %index371
%187 = getelementptr i32, ptr %gep251, i64 %offset.idx372, !dbg !51
%188 = getelementptr i32, ptr %187, i32 0, !dbg !51
...
.lr.ph185: ; preds = %vec.epilog.scalar.ph363, %.lr.ph185
%191 = phi i64 [ %196, %.lr.ph185 ], [ %bc.resume.val377, %vec.epilog.scalar.ph363 ]
%192 = phi i64 [ %195, %.lr.ph185 ], [ %bc.resume.val378, %vec.epilog.scalar.ph363 ]
%gep183 = getelementptr i32, ptr %gep251, i64 %192, !dbg !51
```
After `loop-reduce` we end up with this:
```
vector.body353: ; preds = %vector.body353.preheader, %vector.body353
%index354 = phi i64 [ %index.next359, %vector.body353 ], [ 0, %vector.body353.preheader ]
%157 = shl i64 %index354, 2, !dbg !51
%scevgep649 = getelementptr i8, ptr %lsr.iv642, i64 %157, !dbg !51
%scevgep650 = getelementptr i8, ptr %scevgep649, i64 -96, !dbg !51
%158 = shl i64 %index354, 2, !dbg !51
%scevgep647 = getelementptr i8, ptr %lsr.iv642, i64 %158, !dbg !51
%scevgep648 = getelementptr i8, ptr %scevgep647, i64 -64, !dbg !51
%159 = shl i64 %index354, 2, !dbg !51
%scevgep645 = getelementptr i8, ptr %lsr.iv642, i64 %159, !dbg !51
%scevgep646 = getelementptr i8, ptr %scevgep645, i64 -32, !dbg !51
%160 = shl i64 %index354, 2, !dbg !51
%scevgep644 = getelementptr i8, ptr %lsr.iv642, i64 %160, !dbg !51
...
vec.epilog.vector.body370: ; preds = %vec.epilog.vector.body370, %vec.epilog.ph364
%index371 = phi i64 [ %vec.epilog.resume.val362, %vec.epilog.ph364 ], [ %index.next374, %vec.epilog.vector.body370 ]
%166 = shl i64 %index371, 2, !dbg !51
%scevgep654 = getelementptr i8, ptr %lsr.iv652, i64 %166, !dbg !51
```
So we have created `%lsr.iv642` and `%lsr.iv652` that are updated by the outer loop like this:
```
.lr.ph204: ; preds = %.lr.ph204.preheader, %222
...
%lsr.iv652 = phi ptr [ %scevgep651, %.lr.ph204.preheader ], [ %scevgep653, %222 ]
%lsr.iv642 = phi ptr [ %scevgep641, %.lr.ph204.preheader ], [ %scevgep643, %222 ]
...
222: ; preds = %221, %.loopexit, %.lr.ph204
%223 = add nsw i32 %.0202, -1, !dbg !71
%indvars.iv.next = add nsw i64 %indvars.iv, 1, !dbg !71
%scevgep643 = getelementptr i8, ptr %lsr.iv642, i64 324, !dbg !72
%scevgep653 = getelementptr i8, ptr %lsr.iv652, i64 324, !dbg !72
```
Here is the initial values of these ivs on entry to the outer loop:
```
%70 = add i64 %69, 20, !dbg !14
%scevgep641 = getelementptr i8, ptr @_QMexchangeEblock, i64 %70, !dbg !14
%71 = add i64 %69, -76, !dbg !14
%scevgep651 = getelementptr i8, ptr @_QMexchangeEblock, i64 %71, !dbg !14
```
They differ by a constant, so it would probably be better to use a single loop-carried iv and have a local iv inside the outer loop that is computed as the loop-carries iv offsetted by a constant. Moreover, the constant offset can then probably be merged with the GEPs that use the local iv.
These is just one example from the attached reproducer, as there are more innermost loops vectorized in such a way, and all of them contribute to the increased register pressure. The innermost loops run for less than 10 iterations due to the benchmark setup, but I think this code can be improved.
There is a TODO in the source code about extending LSR to work on multiple loops - this is one option.
I guess there might be passes that can do a "post-cleanup" for such IVs, but I do not for sure. Let me know if there is anything I can try.
Here is another LSR inefficiency affecting flang: #117318
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzsWc1u4zgSfhrlUrAhUZZlH3JIJ53dBnows929cx3QUsnihCIFkrLjfvpFkZJ_5TjZPi2wRgA7Euu_6iuyyK0Va4V4H2Wfouzpjneu1uZ-85O_1NyIu5Uud_c_amFBWNshCAsGJXdYgtMQsTSZpSxZRukDbGtUUEmu1sDL0kI0j5XmUnAbzWPgzhmx6hxaIqw6VTihFXCz7hpUzsLjH_9mcZJH7DmbLab4WtRcrZGBlXprodRbBasdNNoguJorSOKIZRF7hEobwFfetBLpX63Avuwkf8GIPYsC6RdshatJocnvFbcOJg03RR2lT4o7scFoHk8BftRIiguvl67O9BcWUPGVxJLUIFZNI4WBSaVNgROlJ2Fp-uRMRxxJF65KKIV9F1nFpSU6iNjCBYeDqxFKrHgnHYjBuUpvI7acRvFDFD_8E40PCgeKo6hEwZWTOzBYdgUFCa0jfzmwTkgJJTZaWWe4D0SNIawUvij7ZLA1elot4-n6Z5Q9RWxRO9faKH2I2HPEntfC1d1qWugmYs-dRTPhzvGi9gGM2HMlJNI3i_Mkz_I0Ys8nLNkyKP31-zdYo8KghQ-p1LqdFNwYgSW5SVozFRtyx4bLDq13Jen7j89_gCALOp9AFqR4QfD-zmyBmzW281kKUfoEa3QokZRrnQGxoIjQr4hlgf18xuiZmM8gZTMShqqEroUVCrUGfMWic_RLKy_bOl68gG3JkVZqZ6HgnaUFTmtouqKG1qC1ncGBpMFGm10fLNIy_MUPEZv18TzPyT57Ye86eDtnYNJIuWlg0hqh3IRXDk2UPjVo1iiKprX7BZWQDk2_jkqQePz1r9-GYvujFGvh7F_sjGWjy07ixBa6RWBR-jli8wSi_BHWBluIGDtx9TRiD96hjPlF2wImMoofksW5D07__WV3UOX9z3iDEoDNLj3yYEFqwlALVCeg275OtJK7Ia2EUmgabZ0vHOuz2BEO6NaJRvz0OVkjfH_8_Cfgq89KXy0WW07s5M7jpk9RLhRssHDaeG7EjJ4X2hi0rValLwbRSr0-XudLMmILqxt0oqHKX44Q2oJLHij6Mng-BewQ9gMe8MB9UK_SUuotceLG8N2RNYRLp-6DNIbDZyV18RKxhVH46qL0YUnSPAoQFiw9RtxaM6E-cxEjahVeyYDRg_vETyxBKKfPFCcn1eEnvaAucK1rwJdvBJMXlgUBU2rHaZZ6vE4_EdpQo02fCNPOlrDHo4dtnWbeQfRMqBJf02zmCdtaeACMsk8QjxCBbwSP_v1AOyVnpdnydHkv2BP0kpI8COFl6YUkPcWgwWFdNobYKTuC7DW2LEsGvA7MA7ukXJGPkywhfv7N_Ca7JPebB5GywexTNmFN_hE-i-t8Fh_hk8yvM1p-hBEbc9B0Og35NMVWSL2eHocvjym1rnzGMu4Kk31iDO_bOp3PThMwTy4S8JTGoO0anG64TOfsCsvr6Tlkx3UtjzNVV5VFNxXla5qzN3I2P8RicTs5LnL2RMzVKC_ekS6L_K30DUGeSiriZJFdw4vBLwGjvUcH4DjQDlotR-OVLOfnFGcxWRXHkczzy7Aciz-BjyUbF5l9TOTi_SLX2CaL0e3jDTBajoXzvLXTZsTvbmm3G7bohPlbHDae_pxCu__3dIDbnxs9YtoarJGXaEaR_Ga_eH8_6EMTj644qHES_CzUl63l4OR922CPcLV69oeAUah88xAQpN5iTD3xbcYHFQbOk-V1TM8WH7bzWMYoCt02c7RZHfMdxaBRK_O9lfPRjhwELn8tmqMbhNtmLm8yHt0qjNqZ7e1Mr2P3PP6laM7-KzPn13vA_xv9aaNP5vPxAOXJewKUvTdA2WmAxsr_rDt819QIar5BKAz6EVsYaxxCPo_Doev4ceYf-xEPNwhdW3rS1c6fNXRHLcefVfygZLy3hD7K4tn72sp4kuyZXLQVxtg-H-HER_vE8M4Lod27Ojnp8ieMz7JhT5IeJB4Hfe_Bt-TNPi5vdikvWEkWv9-VVxzK2EEjrVt8Fe5cw8FAxtL9plXZrd8X0rqYxT4PJ8lp_uVDXgtVbrixU7HxxXTKZF8e_RLiMcoIfn34dsqVXTTd93LNbnA9K7phfhqGKsIJLoeJo67oqUUQGwtaASpndtAf7g91NTKFIHXy-OQUEbFs7nsRO4PqZHbhwORtU2fx0ZDqcz_A2ENNfpV_D8MXGk3y-S2Vsl9TKRnhfzlR2UEpqgoNQReHQivruPIZbzUIB1vdyRJao1d8JXewQlihoyg4DZ1FPwJXa3k2SxYbD5keVjlIXXBJz4SyosRziPQoKiwUumk7QlEeMuOIoyXqcIzrYfag6xTgN21QbwLyhVlYeNWTQMFVmBgd2-GHk-Ww-fczbht0IbuCAkHv6d5b1qft3511oBUO8zSojG48RZjMYxnmpnTS8DoFgwz6XuEn72ejxNNRFtiuqIHDlu-G-wwuZV8aDVnX3-oMdSEU9S7r5a6FJc8OI_H-iuVcnumUn_VJtHa42QHh0PAw3y-7PfMVqqJuuHkBi67zg8pV5-ALdTX14nsbFLpE7-UVgmhaozdYHrw2XJb8-P3pdzLPT_V1ZwoMhHylOwf46jAML79-_0bCt9q8EAQ0nXSi7VPMwgSGixoKgW5J4V7WF1h3wSCS2Yh17UijlluLfWxJyVIDh4ixVls3KSRyRWYx7xDv-i9_2oOZpQalXf_SO_QrOmgQXpTegqh6aWSh2pFP1vAlZNz-CmJ_X6Q0LfYGCoVVJQqBqtgBryos_KWHn8v6yQFLkyRPk8VdeZ-Wy3TJ7_A-yWfLPJ-lcXxX3_PVfBYzzspklad8wcqCpUkZJ7N4scznC7wT9yxmWTxP0pi-42m-WpYx5ukS8zSeV1k0i7HhQk6l3DRTbdZ3_mLqPpnN4pjdSb5Caf0lJWNBMcai7OnO3BPBZNWtbTSLpbDOHlg44aS_2fz6_Rt15-xTIPU_WzRVlD3Bj-Hq5iJhwV8ekIvuOiPv37gLI4n916Q1-m8sXMSevQU2Ys-9EZt79p8AAAD__7Eku4k">