<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/61659>61659</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[x86-64] Missed optimizations in wide multiplication: mov+add allocation, adc+add merge
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
rrrola
</td>
</tr>
</table>
<pre>
When compiling the following code for widening multiplication (128×128→256 bits) with `-O1` or better:
```C++
using Digit = unsigned __int128;
struct Big { Digit lo, hi; };
Big wide_mul(Digit a, Digit b) {
int s = 8 * sizeof(Digit) / 2;
Digit m = (Digit(1) << s) - Digit(1);
Digit r01 = (a >> s) * (b & m);
Digit r11 = (a >> s) * (b >> s) + (r01 >> s);
Digit r10 = (a & m) * (b >> s) + (r01 & m);
Digit r00 = (a & m) * (b & m);
r10 += r00 >> s;
r11 += r10 >> s;
return { (r10 << s) + (r00 & m), r11 };
}
```
the current `clang` generates the following assembly ([godbolt link](https://godbolt.org/z/Toas9Mcda)):
```assembly
wide_mul(unsigned __int128, unsigned __int128):
push rbx
mov r9, rdx
; Digit r01 = (a >> s) * (b & m);
mov rax, rcx
mul rdx
mov r10, rdx
mov r11, rax
; Digit r11 = (a >> s) * (b >> s) + (r01 >> s);
mov rax, r8
mul r9
mov r9, rdx
mov rbx, rax ; (1)
add rbx, r10 ; (1)
adc r9, 0
; Digit r10 = (a & m) * (b >> s) + (r01 & m);
mov rax, r8
mul rsi
mov r8, rdx
mov r10, rax ; (2)
add r10, r11 ; (2)
adc r8, 0
; Digit r00 = (a & m) * (b & m);
mov rax, rcx
mul rsi
; r10 += r00 >> s;
; r11 += r10 >> s;
add rdx, r10
adc r8, 0 ; (3)
add r8, rbx ; (3)
adc r9, 0
; return { (r10 << s) + (r00 & m), r11 };
mov [rdi], rax
mov [rdi + 8], rdx
mov [rdi + 16], r8
mov [rdi + 24], r9
mov rax, rdi
pop rbx
ret
```
After the lines marked (1), the values of `rax` and `r10` aren't used anymore, so the code doesn't have to allocate a new register and can reuse `r10`:
```assembly
mov rbx, rax ; (1)
add rbx, r10 ; (1)
-->
add r10, rax
```
Later uses of `rbx` should be replaced by `r10`. This saves a register overall – the function wouldn't need a `push ebx` / `pop ebx` anymore.
The same thing is possible in (2).
The `adc+add` instructions in (3) can be merged into a single `adc`:
```assembly
adc r8, 0
add r8, rbx
-->
adc r8, rbx
```
The final code with the changes:
```assembly
wide_mul(unsigned __int128, unsigned __int128):
mov r9, rdx
; Digit r01 = (a >> s) * (b & m);
mov rax, rcx
mul rdx
mov r10, rdx
mov r11, rax
; Digit r11 = (a >> s) * (b >> s) + (r01 >> s);
mov rax, r8
mul r9
mov r9, rdx
add r10, rax ; (1)
adc r9, 0
; Digit r10 = (a & m) * (b >> s) + (r01 & m);
mov rax, r8
mul rsi
mov r8, rdx
add r11, rax ; (2)
adc r8, 0
; Digit r00 = (a & m) * (b & m);
mov rax, rcx
mul rsi
; r10 += r00 >> s;
; r11 += r10 >> s;
add rdx, r11 ; was "add rdx, r10"
adc r8, r10 ; (3), was "adc r8, 0 | add r8, rbx"
adc r9, 0
; return { (r10 << s) + (r00 & m), r11 };
mov [rdi], rax
mov [rdi + 8], rdx
mov [rdi + 16], r8
mov [rdi + 24], r9
mov rax, rdi
ret
```
When compiling widening multiplication for a 32-bit x86 target `-O3 -m32` with 64-bit digits (`using Digit = unsigned long long`), the same optimizations are missed. (1) allows to save a register.
When compiling for an architecture that supports `mulx`, register allocation is not a problem anymore, but the `adc+add` chain in (3) is still missed.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsWE2P4jwS_jXmUgIlThOSA4cGhtOO9jLSHkdOYhLvODaynW56fv2q7ADhIzC9O1q90jutFt2xn_p0VfkhzFpRK86XZL4i882Eda7RZmmM0ZJNCl19LP_VcAWlbvdCClWDazjstJT6HZ9KXeGjgXdRcYUrbSed2EtRMie0AkKzmGZknZB8gf98oSRLSU7pPIVCOEtoDu_CNUDSaPrPmKQRaAMFd44bkrySaEOiV5JG4XdN6Ap__Wpn0d5G1MIBSTbQKR9LBd-_C-XQWNIjrTNd6WAlaiCLVS8iNaFraARJVkAWmxM4fCIWY_redpLQLIgwlAj_Fug4WfQyAABCObDekQwIfQUrfnK9O8p6ON0CPdlBmaCr9VJnZBZ7dLImyRp8hqYw3Lpy9azIRPFRFQOSfCHJlyCP_hCaFUBoCu1QxUA4fip8sbjCxWDxtHxXbTRQ25t_rvOBn9FjhfckvRN0hWJBvDd6iYlPmHgMw11nlC8idNTDzqd08j86e0HXQfGwwBabq7IOj9hZZWcMVw67oZRM1dgPNVfcMMftVe8xa3lbyA-0SearWleFlg6kUD_IfENo1ji3t9hEdEvott-faVMTuv1J6PabZjb_WlbMO5rfttvRQlgetMNtq9H1nf4b6IT-Z9_ZxmeyOFxutPrN_zUhZ9VhWOLYo_9TiV9YYAdvorz2oJNhvxpzLY4ufLvdj_0-G_P9d3fYvaiykaDyX0r3zXZx6EPCJz8q-xl0AWdVdQGPoyfwcmA8GsvWbxscn8mTFSNy2ZOzj24TRccT1cPj-Am8HBgfS9R_MRA_1Q_HlJwNP5unAfN8nl6kpDrWzsMchMU-Y8l4gsNxFYcn6IeF-FvG_XW6yXxlKuEn9GBW3Ed5G9kRO1Z7A2ycHsHX9X2LpS9H7Nhs6AujumqJvd6H_esZbri7e7OFz9ed48bfYVIobqFl5gevThOCrv3eG5Mdt6B3eAeiB2kETFX-KY78k-GK0IWDzvIKmPpoteEobrXX4DlppbkNqIa9cXAamJS6ZI4DA8XfwfBaWHQIlZdMgeGd5Wczz27Dm2QNJuWn5uQNeDrFXnk4Nk43zGWe_8EwoM6e81f4_NlGd7KCgoPhe8lKXkHxcQ51Bt8aYcGyN26BnTOj37hhUoIn7RHJk0BAOlV6av-OSkOOFceTQI3-fufBLPJdXNL740p_WLNhWXxrOFjWcnANshphYa-tFYXkINRxLN5I4HlUJaErVlWoWahA8oVWtpfDdvcnW3Boual5hRxdAwP84iBPOn79rO-O49HJ8-A0y3vIu02Dse6EYjKUtf-q5Iu8Yarm9v_C2f5Qs78SNbudBX8PpnWOO76O-w9xOnJJlH9nFgiluD3gVJQ-HEKDmyjpb-OTnvLEvchi7c2eJ9eo2j-M6hOM6jFxunoDN_aqbacNMEjotBAODlkKjpmau_BmLYFpm1C8Kf0Vkr54VIV9YP0LhDQafZ8mtar9B_p1Zmr-0tZ7J1rxk4WLlxkOrbCWV7PjPPK8690iA0OCMeAXswcx-lgUMFM2wvHSdQb5AXNgu_1eG_Q5jdpOHrxL6wGbCywP8yEsKO2Awd7oQvJ2SBWLzvkYrmlE2TChhvwBeZETUh7DCt5OqmVS5UnOJnwZp4t8ns1pNp80y12ec168sGKeFkkUz8uyzIpFUrA4z1_iOJqIJY1oEiU0iec0i-azhO1oEeVFVfC8zOMX8hLxlgk5k_KtnWlTT4S1HV-mcTrPJ5IVXFr_tpZSpLF-E7twvpmYJcpMi6625CWSwjp71uKEk_417yFLpykWKnz1IV0doVC-wK6KiySvWMQhUYMkYy5PCQwsa9IZubx69SRc0xWzUreEbtGj_s90b_S_eekI3fo4LKFbH-d_AgAA__8UW8v3">