<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/96395>96395</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
LLVM should recognize more patterns that map to pmovmsk or similar instructions
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
jhorstmann
</td>
</tr>
</table>
<pre>
This comes up regularly in functions that process or detect ascii inputs. My current example is in rust, int a `const` function, which prevents me from using explicit simd intrinsics, but the code generated by clang is nearly the same.
```rust
pub unsafe fn ascii_prefix(input: &[u8]) -> usize {
let mut mask = 0_u16;
for i in 0..16 {
mask |= ((*input.get_unchecked(i) < 128) as u16) << i;
}
mask.trailing_ones() as usize
}
```
Ideally this would map to 4 instructions: `compare`, `movmsk`, `not` and `tzcnt`. But currently it gets compiled to about 50 instructions with a mix of scalar and vector instructions: https://godbolt.org/z/jaKb5TMrn
A sightly simpler example that only tries to detect fully ascii chunks leads to similar complex output.
```rust
pub unsafe fn is_ascii_mask(input: &[u8]) -> bool {
let mut is_ascii = [false; 16];
for i in 0..16 {
is_ascii[i] = *input.get_unchecked(i) <= 127;
}
let mut mask = 0_u16;
for i in 0..16 {
mask |= (is_ascii[i] as u16) << i;
}
mask == 0xFFFF
}
```
However, for detection only I found one pattern that generates the wanted instruction sequence, so LLVM has some logic to detect this pattern:
```rust
pub unsafe fn is_ascii_sum(input: &[u8]) -> bool {
let mut is_ascii = [false; 16];
for i in 0..16 {
is_ascii[i] = *input.get_unchecked(i) <= 127;
}
let mut count = 0_u8;
for i in 0..16 {
count += is_ascii[i] as u8;
}
count == 16
}
```
```asm
example::is_ascii_sum::h00bc3be7e76426eb:
vmovdqu xmm0, xmmword ptr [rdi]
vpmovmskb eax, xmm0
test eax, eax
sete al
ret
```
Unfortunately this pattern using `sum` cannot be used for the initial example which needs a mask as output.
#45398 seems related, but focuses a bit more on the bit counting aspect, where `movmsk` might only be beneficial if the `popcnt` instruction is also available.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzcVs1u2zgQfhr6MqghS7ZsH3yIkzW22PbW3WtAUSOLDX9UDuk4ffrFUHISF8GmXWAvGwiRSZEz38x88yOJ9NEh7sRqL1Z3M5li78Pua-8DRSudmzW-fdp96TWB8hYJ0gABj8nIYJ5AO-iSU1F7RxB7GWEIXiER-AAtRlQRJCmtQbshRZrD5ydQKQR0EfAs7WAQNLGckCiK8ha0iyBB1IXyjqKoi2cN_PWx16qHIeAJXSSwCF3wFhJpdwQ8D0YrHYG0bVlQ0I60Ir7YpAixR1C-RTiiwyAjttA8gTLSHRmEw2wTnyJpcS6KO1HcTP_rYnwyzLw1pAaSI9khdG608n4I2OmzKDfZXFHdgChrsdqnjVjdiXILH0T1G6P9jiDW-1EQAIDBCDZFsJIeQFR3UNynRS2qV0c6H4D9CMV8vqivr_PfeHV9y7dFucnPTcYxP2K8T071qB6wZXQMRVS3sOBDW5AErG3c5H19pVms714WrGYeg9RGu-O9d0jiRQgbNjnscufZc6_d-bFFabKvNcGjT6YFKweIHpagHcWQRk5lDzIV7CADspDyljesP1l6eFk7n5kiXcur-F05Xs9hn-KFbszWCEeMmciDNtiyPtn4FGFVXKmFRx17kGD1GXwHpKSRIUs_oYochx8w9jEO_EOUB1Eejr5tvIlzH46iPHwX5eGr_KNZffkc3Gsn3ADpY8_ASHMihOeMyJnkHTsoaCTGOSVTl9htY0qpPrkHAoOyzUdIW8042TyDZ_ApcvB_icaa7kcmc5jf43HjvXmbxhcxmcpite-kIRTVHhY1C_gpWl9kiNVei9XdKOo9RvOhRbl-m77_Zbr9iPZnU-o6sRhLhnM-HA6Hn8mk3_0jnjBwHnTPNVd7N9LnI3Q-uRa8QxhkjBjcSK5LBaRc7h6l42L4itVA-C2hU8iCycOnT399hl4SkLcIxh-1esXKnMaTfM6Cf8U4Svb_Tzjlk4sXxm1-iXDT1XLPt9-i2-Y9EM_KM-j6HXpdlpLsuDNVJw5wdXMVtrzTF0WjqgbXuK6XZY3NMxPgZP2p_ZbgbG2u2GdrH31oYYiBgxVatuHa2tMw1vhmWqM8TzcnfBCR4qsv_LqSQBiR39JM-wHjPyTSn67zISYnI14a0yVjxulC1AXbWhegpHM-QoOQCNscOM4i7XTU0jxX8XFWcYgtcS_h9Jb0ZlUuq-Wq2m6AEC1BQMOzyWVq6bxKhCyi0RGsDwjeZYW8zjFleJIGVHGckTDgVZcEy41mLAkNQoMOO60Yq-6yJFEXgx_GrnlVBjSBNORBnqQ2sjE4n7W7qt1WWznD3WK92Bb1clMWs35XVdW6xUXZNuvlUslKdotmozostptWLqvlTO_KolwWdVkulkWxKOaNKmRdLpt1Ldt1g6VYFmilNnNjTpa750wTJdxt62q7mhnZoKE8qJalw0fIH0VZ8twadnznQ5OOJJaF0RTpRUrU0eAulzDq87ARUPmj4yks-3MK9DTCTpPIRECeZC-d9XXbn6Vgdj_0fR371MyVt6I8sPbp9WEI_msOziFjJlEeRptOu_LvAAAA__8Td3sL">