<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/66317>66317</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
aarch64 backend fails to optimize vector equal as bitmask operation
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Validark
</td>
</tr>
</table>
<pre>
In [this file](https://zig.godbolt.org/z/qzEeEo3KM) I have the following function starting at line 351:
```zig
fn nextChunk(source: [*]align(VEC_SIZE) const u8, prev_escaped: VEC_INT) Bitmaps {
const input_vec: VEC = source[0..VEC_SIZE].*;
// zig fmt: off
const quotes : VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, '"'))));
const backslashes: VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, '\\'))));
const tabs : VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, '\t'))));
const newlines : VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, '\n'))));
const carriages : VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, '\r'))));
const spaces : VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, ' '))));
const underscores: VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, '_'))));
const upper_alpha: VEC_INT = @as(VEC_INT, @bitCast(@as(VEC, @splat(@as(u8, 'A'))) <= input_vec)) & @as(VEC_INT, @bitCast(input_vec <= @as(VEC, @splat(@as(u8, 'Z')))));
const lower_alpha: VEC_INT = @as(VEC_INT, @bitCast(@as(VEC, @splat(@as(u8, 'a'))) <= input_vec)) & @as(VEC_INT, @bitCast(input_vec <= @as(VEC, @splat(@as(u8, 'z')))));
const digits : VEC_INT = @as(VEC_INT, @bitCast(@as(VEC, @splat(@as(u8, '0'))) <= input_vec)) & @as(VEC_INT, @bitCast(input_vec <= @as(VEC, @splat(@as(u8, '9')))));
// zig fmt: on
// ----------------------------------------------------------------------------
// This code is brought to you courtesy of simdjson and simdjzon, both licensed
// under the Apache 2.0 license which is included at the bottom of this file
// If there was overflow, pretend the first character isn't a backslash
const backslash: VEC_INT = backslashes & ~prev_escaped;
const follows_escape = (backslash << 1) | prev_escaped;
// Get sequences starting on even bits by clearing out the odd series using +
const even_bits: VEC_INT = @bitCast(@as(@Vector(@divExact(VEC_SIZE, 8), u8), @splat(@as(u8, 0x55))));
const odd_sequence_starts = backslash & ~even_bits & ~follows_escape;
const x = @addWithOverflow(odd_sequence_starts, backslash);
const invert_mask: VEC_INT = x[0] << 1; // The mask we want to return is the *escaped* bits, not escapes.;
// Mask every other backslashed character as an escaped character
// Flip the mask for sequences that start on even bits, to correct them
const escaped = (even_bits ^ invert_mask) & follows_escape;
// ----------------------------------------------------------------------------
const whitespace = tabs | newlines | carriages | spaces;
const non_linebreaks = ~(newlines | carriages);
const identifiers_or_numbers = underscores | upper_alpha | lower_alpha | digits;
const non_unescaped_quotes = ~(quotes & ~escaped);
return .{
.whitespace = whitespace,
.non_linebreaks = non_linebreaks,
.identifiers_or_numbers = identifiers_or_numbers,
.non_unescaped_quotes = non_unescaped_quotes,
.prev_escaped = x[1],
};
}
```
The [aarch64 emit for this function is mostly unvectorized, and about 10 times longer than the x86_64 (zen3) emit](https://zig.godbolt.org/z/qzEeEo3KM), seemingly because certain/most parts are not being vectorized at all. I tried deleting different parts to see if I could remove the parts that were not being vectorized but no matter what I removed, things were not being vectorized. I did try a few alternative implementations of the equal each to bitmask procedure.
```zig
fn maskForChar(input_vec: VEC, comptime char: u8) VEC_INT {
switch (builtin.cpu.arch) {
.aarch64 => {
const bitmask = comptime std.simd.repeat(64, @as(@Vector(8, u8), @splat(1)) << std.simd.iota(u8, 8));
const chr_matchers = input_vec == @as(VEC, @splat(@as(u8, char)));
const fee = @select(u8, chr_matchers, bitmask, @as(VEC, @splat(0)));
const a = std.simd.deinterlace(2, fee);
const c = std.simd.deinterlace(2, a[0] + a[1]);
const e = std.simd.deinterlace(2, c[0] + c[1]);
return @bitCast(e[0] + e[1]);
},
else => return @bitCast(input_vec == @as(VEC, @splat(char))),
}
}
```
I was hoping that, with a little help, the compiler would be able to find the algorithm I want it to do, but alas, no cigar. So I tried again, this time, begging the compiler to [give me this algorithm](https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/#gist95474023).
```zig
fn vpadd(a: @Vector(16, u8), b: @Vector(16, u8)) @Vector(16, u8) {
const vecs = std.simd.deinterlace(2, std.simd.join(a, b));
return vecs[0] + vecs[1];
}
fn maskForChar(input_vec: VEC, char: u8) VEC_INT {
const char_mask = @as(@Vector(16, u8), @splat(@as(u8, char)));
const bitmask = comptime std.simd.repeat(16, @as(@Vector(8, u8), @splat(1)) << std.simd.iota(u8, 8));
const zeroes: @Vector(16, u8) = @splat(0);
const bytes: [64]u8 = input_vec;
const a = @select(u8, @as(@Vector(16, u8), bytes[0..16].*) == char_mask, bitmask, zeroes);
const b = @select(u8, @as(@Vector(16, u8), bytes[16..32].*) == char_mask, bitmask, zeroes);
const c = @select(u8, @as(@Vector(16, u8), bytes[32..48].*) == char_mask, bitmask, zeroes);
const d = @select(u8, @as(@Vector(16, u8), bytes[48..64].*) == char_mask, bitmask, zeroes);
// Add each of the elements next to each other, successively, to stuff each 8 byte mask into one.
var sum0 = vpadd(a, b);
const sum1 = vpadd(c, d);
sum0 = vpadd(sum0, sum1);
sum0 = vpadd(sum0, sum0);
return @as(@Vector(2, u64), @bitCast(sum0))[0];
}
```
I could not get my `vpadd` implementations to emit the target instruction. I also tried these two:
```zig
fn vpadd2(a: @Vector(16, u8), b: @Vector(16, u8)) @Vector(16, u8) {
return [16]u8{
a[ 0] + a[ 1], a[ 2] + a[ 3], a[ 4] + a[ 5], a[ 6] + a[ 7],
a[ 8] + a[ 9], a[10] + a[11], a[12] + a[13], a[14] + a[15],
b[ 0] + b[ 1], b[ 2] + b[ 3], b[ 4] + b[ 5], b[ 6] + b[ 7],
b[ 8] + b[ 9], b[10] + b[11], b[12] + b[13], b[14] + b[15],
};
}
fn vpadd3(a: @Vector(16, u8), b: @Vector(16, u8)) @Vector(16, u8) {
const shuf = comptime std.simd.join(std.simd.iota(i5, 8), ~std.simd.iota(i5, 8));
const a_vecs = std.simd.deinterlace(2, a);
const b_vecs = std.simd.deinterlace(2, b);
return @shuffle(u8, a_vecs[0] + a_vecs[1], b_vecs[0] + b_vecs[1], shuf);
}
```
I would be very grateful if anyone could offer insight into why this is not being optimized properly/well. Thank you.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzMWkFv6zYS_jXKhbAgyZZjH3JInHgRLLp7aNEFejEoaSSxoUg9krLjHPrbF0NKtmTLcfJe89DAcCuRnPlmOJz5hn5Ua1YIgDsvfvDixxvamFKqu98pZxlVLzeJzPZ3z4J48YMpmSY54-DFj160KI2ptTe996K1F63fWOEXMkskN75UBb7xovW3tyd4ktN__-JFS_JMSroFYkogueRc7pgoSN6I1DApiDZUGXxDDeFMAJnGIUoPHr2g-54H7vPGCvcmF0TAq1mVjXjxooWWjUrBm94jXC-69-JHylkhvGjx-9Nq8-vzH08IJJVCG9IsvGhFagXbDeiU1pDhQpz3_J_fcNoDMxWtNfFuH5w2Qki7lom6MZstpO0S4k0fSas9fgh8_6AufvQRyLQnwjmMvLGC5JVBCTLPTzV8a6QBTez8Iyyrx5sFCTMrqo0XLQ5IcKgdpdoZjPZ5s0DXnOLUbsQZ7kW3XhTZ7-XxMz0zNaHpi-ZUl6C_Dki8ws9VLIYmziVf6ZR4Za4jEbDDINVfjERcR5JSpRgtEMpXIlHXkeiapl8dsuQ6jEZkoHQq1RcG7OYDMOoa1IbyuqQjMDpFLteshsg-AeS-D4R40xXKP2an9nU0v6qz743VJ73xx6k3Rh3C5e4nOIT-Exzy9p5DnDcyVjBz8bD8Xd4I_gneWF4Jj5F6KPplvzdn8jf-tcJbyb8huUllBoRpkijZFKUhRpK9bEgqG2VA74nMiWZV9qeWglCRuYc3KdDQRJqScJaC0JCdAbdJyVKf-5qmJZDID7rZZFeytES9TKS8ySBDDoRzE2mMrFDtkXuNO-YZp4ACsqOayC2onMtdy28MiMyxLqawYJRU0dSAIkxjfTGEHov8xep_GqQ9WmAD6K8hkTo7_Y7x6XaGi_NocZDiwmpFQhuQt6e87OGC2f8CQzR8a0Bg5TlQSCkIbEGQBI9YsicpB6rsQOMcK7OMaFAMNGk0DnjRGWKUsEEJ71aRLtq9WfA7pEYq9_8Z2z690tQMmOeKLGz4ryz3XL57boLXOL6SUGWWbTrjN9Z2PdwatzEHO9zjcCNGxL4eklCW_Y-Z8r-HaFqMaLShfwiSUZxMbEGZTUX1y6krX5Ete_HjcfunD8cTCQTXkB1GtbCnUYFplMCjgrvoRfddhET3drMRjZCGuNfavxw5v6Bk2ILaE4knpxfQWe-IUE2oaMX13p-JW3NWW0wWcS5VLyxNSY2LzUFgIlYjSSqVgtSGZXUWgq3e9rT0djJ-Gri1zeLv7O2XJdBBVduVzIDlgRaz5et4mg90GR-OjBWfHGsco9hSbHBRooC-uMj-y4sW46IuRF4GwrCcgdIbqTaiqRJQTlSPJVpJPbpmn3tsxT67en0BZiPandq0bVsHtnu057AL1eVIVLaB7R97Tf_El8dHL1odF-KfP-Kq4avjEv8dn4wP9dZetHVs4BxmP6cfjn9obxN6U73bx6ODbh9PWv--4zBDePEDpSot5zMCFTP26Lli2V0sME0qqQ3fk0ZsbYpmb7gNK1vCaYI1IQyIYRVowqUobKGmwh7n18V8M5_h4XsDMcVzhlq-7wIEVWqAiomC70kCKW00kBSUoUx40RpRktqmcarA5rEEsDgdUSMxoJz75JkYxSAjGXCwJS9jeQ4KRCfBSNRFWE6ekb_wjCioZHv90k7BvLSDS6qSBoObVNRgItzh5OdWiPWeKZko9GUBCDJjGTFqTyjJYUcoN6AENWwLhFU1hwqEobhJ2nEcIPCtoZwATUu0IGHG5tNayRSyRoF_9ToI56-lWpVU9WlrW3gQdyqrGjfbZnN8b4vxsSz1b3v0jpm0tEylYdww4ad142O8OZ7ycBLhXSjajvLpfEKPWLWm4TE4INIm85FT-gpqsKRgPmtZwhnJWFziEWGf7a-OMpk09EAvFiOsYggwLdWmoiYtD9nhe_tl6-dRHjPUmAN01EMDB0ueOglHLJZwOOf1XXMOIviITupu7jofZcCEAcVtll1EKDEHuOKoqyLogeVED_bB5bz3hMJVoWlfaHpFaFteBswV-gLgHQGYhE9zOXANXZSPCf9EsAzD47QSfKAMPNump5Q15h9Maqhgx0xJKOHMGA6kBF67nAX2tDGOKc2mxQQITThguslZ2yhRXkjFTFmRZ0c9mWWfmbTB12AOpi3XJCkrqPLJr_KQk2lh8_nK1SE82HYZFIUD2INgJBawAhNiBW7-QfdYkUkUFWmZK4C2xkRBuPSidTDDr9CL1jkzWA8m1X5SAs0mprS97MSUMKGqmpSSg55INTE7OTlw1ImRE90k2jDTGJjkOF7CpGIau6NJXcltpV-SCRPaqMbW1YkUVqAA7H_XXjQtmDbLeHY7CyKslNdz9bamWeZFC3sf1M9t4XyQ3JJ3x5eXhsYu7reQ6qtH6zD2p8SNXFAH4jyXtJGPQvuHqX225-mcyXyiUl0pUF2ipmpzqCVjpeLEnd-RqD9Vs5y6n1OzHLA3UNLd9l4Mhbau9GvDyY1csjetjPhhPvPix2YxrHsjiumFgvWBXXDq7I9F4bz7mcghtf7tdvWk3rWmjsD_cSjh3Pen0Q9hOamKPwRnGvn-bPH3wMl-HM5s4fs2Lr4bzvDG8T7LHNHtqK_jw9r-oomVwQ2aEpRNSk2agtZsC3zfXh1o0-S5m7awMN3tAxNGEik6soxO2FJFdFMF1gvHtNultZHfcpoqHExOcXJ2OvlMJr5waKvwM5ODd9visV2ymbpBfrw8u63uBEbLNi1_sKHsWiVsZwowpNoTbx44uPPgrGfBPcKeE3fPUIUregUS-x_KtWxZgSlBAzE7-ZEfta3K6GeVxs7LmAFs3jvrW5C0kgGFJW3f7h6iwdC0PzQbDMX9oflg6Pb0IuCgeDGYt-yJCIe0uo8pHGAK-5jCAaYwHlOcDCxO-hYnA4uTvsXJwOKkb3EysDi5aHEysDjpW5wMLE76FicDi5O-xcnA4mTU4vE7l0E0Tn8uUdNlk18gGi0rOyUJLO7fsP_13vBYxqObD3FDeoEafWz1WbY9Jjg0OOdwqEwOz6Bz3PSJJUo7m5KcTkGpg9R6rZPqeiJ7M14oaiBvOGE5oWIvBbQJUuY5KEx2rCiNKzi7cu8aGKZ710ES985eJ9VK1qCwdq13wLlPfiupeCF72fg32d00W06X9AbuwvlyNo-XQRjdlHd5FgTBchqES6DpAuaLIJwtg9k8nUE-z6L4ht1FQTQNliH2HPM48m-neQiLLJjBjAZwG3uzACrKuM_5tsKe6YZp3cDdfD4Nb284TYBr-0-fokjAjthBL8KDdKPucM0kaQrtzQLOtNFHKYYZDnfddU9C0xcQGckp47YsdGa312HttRbVBwqNvrBV5KZR_G7Y5BXMlE3ip7LyojVqbP8zqZX805KXtcWpvWht7fh_AAAA__--Rbg4">