<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/142046>142046</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
64 bit shifts on 32 bit targets fail to utilize shift range information
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
tom-rein
</td>
</tr>
</table>
<pre>
When we know that a shift is greater than 32, we can translate a 64 bit shift into a 32 bit shift and set the lower register to 0.
Clang fails to do this optimization in most cases. This happens at least on ARM, RISC-V and x86, others probably also.
Here are some examples ([godbolt](https://godbolt.org/z/9Tdf1xKxd)):
```C++
#include <cstdint>
uint64_t shift_assume_ge32(uint64_t x, int s) {
__builtin_assume(s >= 32);
return x << s;
}
uint64_t shift_assume_bit5(uint64_t x, int s) {
__builtin_assume((s | 32) == s);
return x << s;
}
uint64_t shift_set_bit5(uint64_t x, int s) {
s |= 32;
return x << s;
}
uint64_t shift_set_bit5_assume_set(uint64_t x, int s) {
__builtin_assume((s | 32) == s);
s |= 32;
return x << s;
}
uint64_t shift_32_plus(uint64_t x, int s) {
__builtin_assume(s >= 0);
s += 32;
return x << s;
}
uint64_t shift_lt32(uint64_t x, int s) {
__builtin_assume(s < 32);
return x << s;
}
uint64_t shift_mask31(uint64_t x, int s) {
s &= 31;
return x << s;
}
uint64_t shift_assume_masked(uint64_t x, int s) {
__builtin_assume((s & 31) == s);
return x << s;
}
uint64_t shift_mask31_assume_masked(uint64_t x, int s) {
__builtin_assume((s & 31) == s);
s &= 31;
return x << s;
}
```
Assuming that the shift amount is greater than 32 or the 5th bit is set has no effect (`shift_assume_ge32()` and `shift_assume_bit5 `).
Similarly assuming it is less than 32 or the bit is unset also has no effect (`shift_lt32()` and `shift_assume_masked()`).
Setting the 5th bit or masking by 31 (`shift_set_bit5()` and `shift_mask31()`) result in correctly simplified output.
Unless we assume the value is in range (`shift_set_bit5_assume_set()` and `shift_mask31_assume_masked()`).
Then the bit masking is correctly removed, but clang forgets the range info and emits a general shift again.
Sometimes we have a conditional shift by 32, which on a 32 bit target is just a conditional move, but clang replaces it with a general 64 bit shift, which sometimes gets only partially simplified. In particular on RISC-V it fails to realize that the left shifts are nops.
```C++
uint64_t conditional_shift32(uint64_t x, unsigned b) {
__builtin_assume(b <= 1);
return b ? x << 32 : x;
}
uint64_t shift_shifted_condition(uint64_t x, unsigned b) {
__builtin_assume(b <= 1);
return x << (32 * b);
}
uint64_t conditional_shift_bit5(uint64_t x, unsigned b) {
return (b & 32) ? x << 32 : x;
}
uint64_t conditional_shift_bit5_assume(uint64_t x, unsigned b) {
__builtin_assume((b & 32) == b);
return (b & 32) ? x << 32 : x;
}
uint64_t conditional_shift_assume_bit5(uint64_t x, unsigned b) {
__builtin_assume((b & 32) == b);
return b ? x << 32 : x;
}
```
For `conditional_shift_bit5()` and `conditional_shift_bit5_assume()` clang sees that it is just a shift by `b`, but in `conditional_shift_assume_bit5()` it fails, even though it is only missing the `& 32`, which gets removed anyways.
The above also made me realize, that the 64 shift for 32 bit RISC-V is super weird:
```
shift(unsigned long long, int):
addi a4, a2, -32
sll a3, a0, a2
bltz a4, .LBB0_2
mv a1, a3
srai a0, a4, 31
and a0, a0, a3
ret
.LBB0_2:
sll a1, a1, a2
not a2, a2
srli a0, a0, 1
srl a0, a0, a2
or a1, a1, a0
srai a0, a4, 31
and a0, a0, a3
ret
```
The `srai + and` before the `ret`s are totally useless. The first is always 0 and the other is always a0.
With the Zicond extension it is slightly better:
```
shift(unsigned long long, int):
sll a1, a1, a2
not a3, a2
srli a4, a0, 1
sll a0, a0, a2
addi a2, a2, -32
srl a3, a4, a3
slti a2, a2, 0
or a1, a1, a3
czero.nez a3, a0, a2
czero.eqz a1, a1, a2
or a1, a1, a3
czero.eqz a0, a0, a2
ret
```
The `addi` is still there, even though it could have been combined with the `slti`. Also a minor nitpick, if the `srli` was placed after the `srl`, we could avoid using a4 and use a compressible `srli`.
This is how I would write it:
```
shift(unsigned long long, int):
sll a1, a1, a2
not a3, a2
srl a3, a0, a3
sll a0, a0, a2
srli a3, a3, 1
slti a2, a2, 32
or a1, a1, a3
czero.nez a3, a0, a2
czero.eqz a1, a1, a2
or a1, a1, a3
czero.eqz a0, a0, a2
```
(The `slti` could also be replaced with `srli a2, a2, 5`, which has a compressed version, whereas if I'm not mistaken `slti` does not)
This can be rewritten to use only base instructions with only two more instructions:
```
shift(unsigned long long, int):
sll a1, a1, a2
not a3, a2
srl a3, a0, a3
sll a0, a0, a2
srli a3, a3, 1
addi a2, a2, -32
srai a2, a2, 31
or a1, a1, a3
and a1, a1, a2
not a3, a2
and a3, a3, a0
or a1, a1, a3
and a0, a0, a2
```
(The `addi` could be replaced with `slli a2, a2, 26`, but I'm not sure which is better)
This to me is preferable over a branchy version from a side channel perspective alone, since it is highly unintuitive that a shift introduces a branch.
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzcWU1v4zgS_TXKpdCGTNmOffDBSTbYxu5eprM7wF4CSipbnKZIL4uKk_z6RZGSLMcfnU5nLhMYCSKSVY9Vj6-KsiRSG4O4TKY3yfTuSja-sm7pbf3FoTJXuS1flr9XaGCH8N3YHfhKepBAlVp7UAQbh9Kj4-cGMpGIW55aSAPeSUNaegQJswnkynerjLcgIRODZ9KUQOjBVwja7tCBw42iYNlCOkrS1a2WZgNrqTTxs9KCrxSB3XpVq1fplTWgDNSWPBSSkEbwwBMqud2iIZAeNEryYA2sfvsXI_3t67fbL_8Jzp_nM35ifYWOYOtsLnP9AlKTZed_R4cgHQLZGgGfZb3VSJCIeTK92dgyt9on07tEzCvvt5Rkq0TcJ-K-HRpZt0nE_Wsi7hcP5Xr8_I_nMhEL_mSrJF0lszR-bhNxw590lYhMmUI3JUKS3RbkS2V8kv2Nh9JVo4yfTR7b8D1KoqbGxw1yBub94DNvSRkPlIgFJNdsFx4f80Zpr0y7KhFzAjac3YUELpIszAMAh75xBp4ZQZLdAsWh5PruEopc-ekPUQCcAhKxXN9GIJBkd4yKfgUUoX8fouC4DcKvOutCQej_xEh8BuJMPG51Qx9E2RMnfQtM3PwiMO0_zuXbXyRyLel7Nn5PTHins7DT8a8eG3aK5Ue2HHctZgzik05NjMBPQrvE5MvoPhTGXjTjvyt2qMwmliguJG1pqW1jTtUqsC5Mm_oqVCJFoQRVksBYwPUaCx8UfpaeUlnewCwNxePtDJYAfpiIBRePb6pWWjouJx3G6E4j0Vs0LZLGMBYuPxcAtWfkApI-c3FOBwi9j5Ha79464Mn8OH-BbHzgaCCip5z156X3Ag6p0VzrobDOYeH1C5Cqt1qtFZZgG79tPGP5twlR2GEMDgZQT1I3yGFQBpw0GzyJ5lBlzwO7HIwHbm-6yHchUDTA7bC2T7zwFvLGQxEbEes26CmsjBCVWdvgHmvlCSRs0KCTuuPhRiozilz9Zmv0qsaw70o-cY9UWFMq7mL6FZyH2FFVqqi4b-m7Ji_ZO8P8oyH_ZjWjPQTrcKtlgcS82ylfDcANW7O9L-oBhk1ao19gK51XUh8kcgRfTRwoGi0dY2ybKuX3vZpDqdUr7k-mxnXrkkJXZeyWRqf7oF5yBjt8DEtP1IfGhH62hPyyYuZRU-5gfEokefR-rzuZgCRbwfMPSj__xvKxh_k-cKc18zK-Hlki5gxOrILJc_iOAne6HzoHr3UaUbGMtw3JT0ToNIL9dj-WxCNEobDkpyL2ueAv9bl_Avj30HFQCu-tYwE8m_VDnfxRatrZUUYIkeIhVkPt6dUqmaV5ENaoPcqc9nAYv9ZDpxe8GJ-CJNtmU7WeggDViqgrW8FNiF70F2UriFUr1yDNy06-0AhiQh8qBJlb1lquqrUsEWrstIlt9PI0m7R7WlvXKW4nawTUbNHBDpUr39zeknTV6ui854G2ZhN-tR1Tf-eD9keWpQp_JzxDBsH_konBDNI6zszCjLSdt5-Qa_-6NzH6581N-jgcr59AjsOqbGjWyeg4Ggxrs_EQmSlhMCHdW3Dok3TVOTrYTg82OhwfgTXWxwliP0ROq7eexgdQ9TGSoVHr4NhrGk1_eJftT9zsMMkPkYDBdCJu2ARTOMe1ddjRk5fN0ljfvPWhcDaE3O2MgC2slaNAbqmZpxDPJK8ObyEGIzK8APmdCzcP_1fxmQJ89mgovPaIzatWm4r7lRy9R_cZ3DyfzD6L2VEu-mROziRTn0tmfxTE2aPgDo7C5JjU2h-ZSH9ElKGB4hWdHRl87ehxdOjiDPxfP-Mc09_piy2dJfZZ8nGsgm4SkFdaMzMcnlDPwja6jE1mjsj9eJ0rzv6u4xPTVHu2NoIVa6OEWhnrwCi_VcX3wI51P9Xp4HgnCUJXWYJcx2tVN95pMrbO5ZNVJTRBvOUk0Lyh2PPWW4dEKtdD27Et5_afoLI7-Aq7YGfnlEdQ_rOofVGjTjP7kA8hmcHSufyFw9CuyY5PglcHTI1ct-499DxbDQasOtrfj02fZOPBVVvMHw5I0yWZmZNjd9do-dUm9WCX04OazdfbPROwhCd0FPpnnoEOJTH7vibiug6JqRV5-R0NDDGUFvmWHPLckqeQJuJh2ng-EzbQLvQSuSS-sZF3TcHNCUW4YczvLNSs5MPxvwrngsaeENhQzA64OD6UsrO84eP8s7sKawYIZfozzt5Hz04hIz1PMlO_YaaYDdrXPeOocdiyVVFXXvdE85YbSUWwdbhGJ1nM7BM6kJA7aYrqpeM0rJ2tuWNWJUJRSWNQwxYdbbHwKrSm1gQRJ2UKbAt7pTYVNw9GGd-oMO_waxjjnS0bvt93HllBr8plVi6yhbzC5fh6Mp9Or8VkclUt83VepOv1BNflWMwXxSQvU3k9zcR6mqLIF1dqKVIxTadiIYRIJ5PRdFyWi_lkUkjMcZHOk0mKtVR6pPVTPbJuc6WIGlyOJyKdzK60zFFT-FJJCIM7CKOJEMn07sotedGXvNlQMkm1Ik97M155jcvhOwnu_Q_felC4KITT7FV4rxDDsH8L4-rwddBV4_TyzfcxyldNPipsnYh79tr--bJ19g8sfCLuA1ZKxH27mael-H8AAAD__y2B-08">