<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/89937>89937</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
x86 backend generates unoptimal loop alignment for Skylake and Sandy Bridge processors
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
lucic71
</td>
</tr>
</table>
<pre>
When compiling the following LLVM IR file: [test.ll](https://drive.google.com/file/d/1s_ERe3hP-OmoMrVrFELLkYAdfy2MPbWA/view?usp=sharing) we observed a considerable performance gain by modifying the loop alignment generated by LLVM inside the function called `test`. The default alignment generated by LLVM is 16 bits, as shown in [test.S](https://drive.google.com/file/d/1qRDBV9KHpCZaogN7QI09cDMJDfkwWYI_/view?usp=sharing) at line 34.
On a Skylake machine we got the following performance with the alignment generated by LLVM:
```
Elapsed time: 2.512872 seconds
```
By changing the `p2align 4` at line 34 inside [test.S](https://drive.google.com/file/d/1qRDBV9KHpCZaogN7QI09cDMJDfkwWYI_/view?usp=sharing) to a `p2align 5`, we got:
```
Elapsed time: 2.196517 seconds
```
That's a performance improvement of around 12%!
On another machine, this time Sandy Bridge, we saw the following numbers: 4.3s for p2align4 and 2.5s for p2align5.
The steps for reproducing the issue are:
```
$ llc -o test.S test.ll
$ clang -o test test.S
$ ./test
Elapsed time: 2.512872 seconds
$ # in test.S at line 34 change `p2align 4` with `p2align 5`
$ clang -o test test.S
$ ./test
Elapsed time: 2.196517 seconds
$ clang --version
16.0.6
```
This is how the assembly looks like for the `test` function with p2align4:
```asm
00000000000011b0 <test>:
11b0: 48 85 f6 test %rsi,%rsi
11b3: 7e 13 jle 11c8 <test+0x18>
11b5: 48 8d 46 ff lea -0x1(%rsi),%rax
11b9: 39 54 b7 fc cmp %edx,-0x4(%rdi,%rsi,4)
11bd: 48 89 c6 mov %rax,%rsi
11c0: 75 ee jne 11b0 <test>
11c2: b8 01 00 00 00 mov $0x1,%eax
11c7: c3 ret
11c8: b8 ff ff ff ff mov $0xffffffff,%eax
11cd: c3 ret
```
And this is how it looks with p2align5:
```asm
00000000000011d0 <test>:
11d0: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
11d7: 00 00 00
11da: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
11e0: 48 85 f6 test %rsi,%rsi
11e3: 7e 13 jle 11f8 <test+0x28>
11e5: 48 8d 46 ff lea -0x1(%rsi),%rax
11e9: 39 54 b7 fc cmp %edx,-0x4(%rdi,%rsi,4)
11ed: 48 89 c6 mov %rax,%rsi
11f0: 75 ee jne 11e0 <test+0x10>
11f2: b8 01 00 00 00 mov $0x1,%eax
11f7: c3 ret
11f8: b8 ff ff ff ff mov $0xffffffff,%eax
11fd: c3 ret
```
Notice that the loop size is 18 in both cases (starting from the `test` instruction and ending at the `jne` instruction), the only thing that differs is the start address of the loop. For p2align4 the start address is 11b0 (aligned to 16 bits) and for p2align5 the start address is 11e0 (aligned to 32 bits).
There was a [similar issue](https://lists.llvm.org/pipermail/llvm-dev/2021-January/148177.html) raised a few years ago where Maxim Kazantsev thought that this workload is bound to decoding so we decided to gather the same perf numbers as he did:
```
p2align4:
3,181,514,079 idq.all_dsb_cycles_4_uops ( +- 0.21% ) (29.99%)
6,271,689,140 idq.all_dsb_cycles_any_uops ( +- 0.23% ) (30.08%)
6,306,733,259 idq.dsb_cycles ( +- 0.22% ) (30.08%)
16,393,899,425 idq.dsb_uops ( +- 0.12% ) (30.13%)
```
```
p2align5:
3,194,348,681 idq.all_dsb_cycles_4_uops ( +- 0.18% ) (29.93%)
3,367,347,174 idq.all_dsb_cycles_any_uops ( +- 0.15% ) (30.03%)
3,361,525,370 idq.dsb_cycles ( +- 0.14% ) (30.13%)
16,302,023,646 idq.dsb_uops ( +- 0.12% ) (30.15%)
```
Notice that p2align5 is delivering more 4upos batches than p2align4 and the number of cycles spent in DSB decreases for p2align5.
In the linked issue, Maxim proposed some interesting solutions for solving this problem:
```
Align loops by 32 if:
* They are innermost;
* Size of loop mod 32 is between 16 and 31 (only in this case alignment by 32 will strictly reduce the number of 32 window crossings by 1);
* (Optional) The loop is small, e.g. less than 32 bytes;
* (Optional) We could make even sharper checks trying to ensure that all other conditions of DSB max utilization are met (may be very complex!)
```
cc: @nunoplopes
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzEWN1u4zYWfhrm5iCGREmWdJGLiTPGTjvTdjuDDro3ASUeWZxQpErSdtynX5CSY0txspl2FysYRmJS3_n5zh_JrBUbhXhDsluS3V2xrWu1uZHbWtR5fFVpfrj52qKCWne9kEJtwLUIjZZS7_1_Hz_-9gk-_AqNkEiSd0CyW4fWLaQk2R2hRetcb0nyjtA1oWtuxA4XG603Ehe17ghdhxfpmhO6ju39-18xaX-5_rnTn8xvZv3-48eH39_x5kA__VJ99SA7gXuSrLe2J8mdbZkRakNoCXsEXVk0O-TAoNbKCo6GVRKhR9No0zFVI2yYUFAdoNNcNIejOVLrHpgUG9WhcrBBhYY55H5nMFAEuMH0raqd0ApqJiVyIMvIG0yW0QK-tAgcG7aV7nU4C_ESKuEsoStgFmyr9wqEenLf57_gvT9-vbv9rfzxH_3qX0xvfsr_-SEq67tPP9w1D_uvv3-4f8V7zIEUCiFJFyS6I9G74ftnBQw-Pxwke0DoWN36PXuEjXazMDj38V64Niy_4gJv0yBoGY2f8O97yXqLHJzoQjjRRRbTIqdgsdaK24svDd-3B6hbpjZHUsky6mlQAVKyjM5sPNL5f3K208DOtcu8IXQ1OvbtnonLZRbnb_DMl5Y5QnMLbMKT6HqjdxgI0g0wo7eKQ0wJzQiN53GgtGvRHKPA6-taYYM68JkpfoBbI_gGR0ss289CRG27Co13MKSLxEKjDYwuSIEp7rme_JotpkYgWIf9sMVgbzTf1keyhbVbBGbwJfcRmoKUNVxrGDiHY5l6Wq4lU5vjhnHXaXVB6Dok-vfEKU2B0MQn9ij0LAhDsD4P05A98-j4L-l4MWJOqNc7NFZoNSzEy0W0WL4aVsL6UtbqgWpmLXaVPPhy-mBBigcMXI3JOFbJUwENlh4j4DlvzHbDL9HZE8dVBCRZBbDk_dNb4Be8kcOTFlBk0Cxh9gSvARCaGSsIXY1_DBAAHiU5oeQIcQLfJA5LdfEkmd5Gj3HhFTh_NZsqwCFdQtOAROZ_uo4eY0KLo-xyFM8eJxjlCSMpIUuhyqGpJ0bUXe8tQP5I6Oo6ekxHVH5mEV2lXsQ5Mp9qV0L9zD2d3sGo0yXf1GcezjNAnL8PAN_U6K0pT0eSanqCqAqIYoii8TPRIpCUBo95RXDqpTo_odQJGHST1WIio2lOn5mlafTYjM9lOfyynIvp8E7xoSSOKSHcmAjncZ69Oc755TgPevEzIpZLoAhRA3EDRXryZW1B6X4P0WM0xscTq-GPeBYf_MynR06eaONsInCQlqYT5oI0eLM8DDb89UTFlxO1mSYqnScqvpCo549P2jdlLP7PMhb_fsY235OxGE3rWzRzW_N3czeOm2niXnxm2dy8nM3nwl7N4-Z78vgn7UTtZ33mTocDK_7EMLYXvplX2rVQM4sWCC2sY8b5QaQxupt3O6GsM9uh4fkhBxX3W0dssoy-KZztGyItrGslD76ohCmHOeCiadCEEuPCPMSMA8a5QWv9DHfUdwHr8-nq-V5vSijRtAh7_JigT4eSMuh6Pou9hIFzjIQeMebjm0HYMz-EkuzWik5IZoa57dL4LYV1diHlrltosyF03YseTceE9Ity111z3BG6phGNr39gasvMwc_laRHn-aJ1nfRWGCZsOBA2uIcDMmOBbTTsgzKf2KPo4Ef2J1PO4g5cq7eb1h2ZFxb22jxIzbg3tQrTsdPAsdaBQ6v9oMuxFnwwfcPCiBw8xbrh4Hkce_0pz58OBX9pQn02CQ0lxZfOwqdSFqeErqK8HOJY8D8WTMp7bqv7-lBLtPfp_Vb3ISSB0NtrgGhBY0Iz8K7wP9NyUZZhvD8rNLAkdEVzL2JZ-MiL0-gSOlOHC_jJOX4SLaLiIn4S-e888ebQrAz4J-wpJP2PkHGALD1Y4Q1apTR7gnymYzwHjJNzwHkRuMjKqW8PjJSeiyQtgtPi72EkLuaMTLQJ8MkyD_D-O87TF-EvUhJnc_9dEhAiyv-8SvLoJOAFTnzoveLCkZCI-gD1IbFapssp6BtYyV5j5VlpfipMwgJHKXboz9fQaYOQbnttoWKubtFXSqamJ02foUNe-qI52mt7fxAWCu4-3_qsNhgK_IsH0g9qKLdCPSAfCxldjVWlN7rXvvRY3SEI5dCgdUPVkFtf5Qdkq-VuqO7C-pcqid1LBeJdOA_68m6hOvhCK5qzYkGoz40vLR78ORiEUmg67cfH2_Mtn30n083Q1jrNA46FCt0eUfkW4D2UxJ6V0H784dVr5_vd2ZXOoMFeSAnWGVE7eQCDfFvjzL1hl-J6D7XR1gq1CeqHeXCqGqHFz733DQvF-8ux-QoLtmNSevfiYrMA6dtPoNV3m4NDO0OCOdZXhFpvJYeOPSDgDhXYlpkeDdQt1g8WnBmuAzWgslszRhmTEoaLD39kFgNxugkx0rFH2DohxZ9saO8GoUPnRXfsABXCDs0h3JxKfCQ0fj246zrcn6aR2irdS92jhSt-k_AyKdkV3sR5nJR5FOf0qr2JaYw8ixnPyzwq8nTJygKjqKB1xhhmxZW4oRFNo5SmcZqWSbxoaJVwlvAGsayiYknSCH07feqxVyGEb4qyTPIrySqUNtwKU6pwf4xvSrK7K3MTGnC13ViSRtNOfeWEk3jzWCyhYvUDKv50-WfBG-ZEx-T8xtWnwvGe0Yff-YWST4sardXGXm2NvJnOCRvh2m01XtB5JY7jQW_0N6wdoeuguiV0HUz7dwAAAP__X3x9ZA">