<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/66317>66317</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            aarch64 backend fails to optimize vector equal as bitmask operation

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          Validark

      </td>

    </tr>

</table>

<pre>

    In [this file](https://zig.godbolt.org/z/qzEeEo3KM) I have the following function starting at line 351:

```zig

fn nextChunk(source: [*]align(VEC_SIZE) const u8, prev_escaped: VEC_INT) Bitmaps {

    const input_vec: VEC = source[0..VEC_SIZE].*;

    // zig fmt: off

    const quotes     : VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, '"'))));

    const backslashes: VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, '\\'))));

    const tabs       : VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, '\t'))));

    const newlines   : VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, '\n'))));

    const carriages  : VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, '\r'))));

    const spaces     : VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, ' '))));

    const underscores: VEC_INT = @bitCast(input_vec == @as(VEC, @splat(@as(u8, '_'))));

    const upper_alpha: VEC_INT = @as(VEC_INT, @bitCast(@as(VEC, @splat(@as(u8, 'A'))) <= input_vec)) & @as(VEC_INT, @bitCast(input_vec <= @as(VEC, @splat(@as(u8, 'Z')))));

    const lower_alpha: VEC_INT = @as(VEC_INT, @bitCast(@as(VEC, @splat(@as(u8, 'a'))) <= input_vec)) & @as(VEC_INT, @bitCast(input_vec <= @as(VEC, @splat(@as(u8, 'z')))));

 const digits     : VEC_INT = @as(VEC_INT, @bitCast(@as(VEC, @splat(@as(u8, '0'))) <= input_vec)) & @as(VEC_INT, @bitCast(input_vec <= @as(VEC, @splat(@as(u8, '9')))));

    // zig fmt: on

    // ----------------------------------------------------------------------------

 // This code is brought to you courtesy of simdjson and simdjzon, both licensed

    // under the Apache 2.0 license which is included at the bottom of this file

    // If there was overflow, pretend the first character isn't a backslash

    const backslash: VEC_INT = backslashes & ~prev_escaped;

    const follows_escape = (backslash << 1) | prev_escaped;

    // Get sequences starting on even bits by clearing out the odd series using +

    const even_bits: VEC_INT = @bitCast(@as(@Vector(@divExact(VEC_SIZE, 8), u8), @splat(@as(u8, 0x55))));

    const odd_sequence_starts = backslash & ~even_bits & ~follows_escape;

    const x = @addWithOverflow(odd_sequence_starts, backslash);

    const invert_mask: VEC_INT = x[0] << 1; // The mask we want to return is the *escaped* bits, not escapes.;

    // Mask every other backslashed character as an escaped character

    // Flip the mask for sequences that start on even bits, to correct them

    const escaped = (even_bits ^ invert_mask) & follows_escape;

    // ----------------------------------------------------------------------------

 const whitespace = tabs | newlines | carriages | spaces;

    const non_linebreaks = ~(newlines | carriages);

    const identifiers_or_numbers = underscores | upper_alpha | lower_alpha | digits;

    const non_unescaped_quotes = ~(quotes & ~escaped);

    return .{

 .whitespace = whitespace,

        .non_linebreaks = non_linebreaks,

 .identifiers_or_numbers = identifiers_or_numbers,

 .non_unescaped_quotes = non_unescaped_quotes,

        .prev_escaped = x[1],

    };

}

```

The [aarch64 emit for this function is mostly unvectorized, and about 10 times longer than the x86_64 (zen3) emit](https://zig.godbolt.org/z/qzEeEo3KM), seemingly because certain/most parts are not being vectorized at all. I tried deleting different parts to see if I could remove the parts that were not being vectorized but no matter what I removed, things were not being vectorized. I did try a few alternative implementations of the equal each to bitmask procedure.

```zig

fn maskForChar(input_vec: VEC, comptime char: u8) VEC_INT {

    switch (builtin.cpu.arch) {

        .aarch64 => {

            const bitmask = comptime std.simd.repeat(64, @as(@Vector(8, u8), @splat(1)) << std.simd.iota(u8, 8));

            const chr_matchers = input_vec == @as(VEC, @splat(@as(u8, char)));

            const fee = @select(u8, chr_matchers, bitmask, @as(VEC, @splat(0)));

            const a = std.simd.deinterlace(2, fee);

            const c = std.simd.deinterlace(2, a[0] + a[1]);

            const e = std.simd.deinterlace(2, c[0] + c[1]);

            return @bitCast(e[0] + e[1]);

        },

        else => return @bitCast(input_vec == @as(VEC, @splat(char))),

    }

}

```

I was hoping that, with a little help, the compiler would be able to find the algorithm I want it to do, but alas, no cigar. So I tried again, this time, begging the compiler to [give me this algorithm](https://branchfree.org/2019/04/01/fitting-my-head-through-the-arm-holes-or-two-sequences-to-substitute-for-the-missing-pmovmskb-instruction-on-arm-neon/#gist95474023).

```zig

fn vpadd(a: @Vector(16, u8), b: @Vector(16, u8)) @Vector(16, u8) {

    const vecs = std.simd.deinterlace(2, std.simd.join(a, b));

    return vecs[0] + vecs[1];

}

fn maskForChar(input_vec: VEC, char: u8) VEC_INT {

 const char_mask = @as(@Vector(16, u8), @splat(@as(u8, char)));

    const bitmask = comptime std.simd.repeat(16, @as(@Vector(8, u8), @splat(1)) << std.simd.iota(u8, 8));

    const zeroes: @Vector(16, u8) = @splat(0);

 const bytes: [64]u8 = input_vec;

    const a = @select(u8, @as(@Vector(16, u8), bytes[0..16].*) == char_mask, bitmask, zeroes);

 const b = @select(u8, @as(@Vector(16, u8), bytes[16..32].*) == char_mask, bitmask, zeroes);

    const c = @select(u8, @as(@Vector(16, u8), bytes[32..48].*) == char_mask, bitmask, zeroes);

    const d = @select(u8, @as(@Vector(16, u8), bytes[48..64].*) == char_mask, bitmask, zeroes);

 // Add each of the elements next to each other, successively, to stuff each 8 byte mask into one.

    var sum0 = vpadd(a, b);

    const sum1 = vpadd(c, d);

    sum0 = vpadd(sum0, sum1);

    sum0 = vpadd(sum0, sum0);

    return @as(@Vector(2, u64), @bitCast(sum0))[0];

}

```

I could not get my `vpadd` implementations to emit the target instruction. I also tried these two:

```zig

fn vpadd2(a: @Vector(16, u8), b: @Vector(16, u8)) @Vector(16, u8) {

    return [16]u8{

        a[ 0] + a[ 1], a[ 2] + a[ 3], a[ 4] + a[ 5], a[ 6] + a[ 7],

        a[ 8] + a[ 9], a[10] + a[11], a[12] + a[13], a[14] + a[15],

        b[ 0] + b[ 1], b[ 2] + b[ 3], b[ 4] + b[ 5], b[ 6] + b[ 7],

        b[ 8] + b[ 9], b[10] + b[11], b[12] + b[13], b[14] + b[15],

    };

}

fn vpadd3(a: @Vector(16, u8), b: @Vector(16, u8)) @Vector(16, u8) {

    const shuf = comptime std.simd.join(std.simd.iota(i5, 8), ~std.simd.iota(i5, 8));

    const a_vecs = std.simd.deinterlace(2, a);

    const b_vecs = std.simd.deinterlace(2, b);

    return @shuffle(u8, a_vecs[0] + a_vecs[1], b_vecs[0] + b_vecs[1], shuf);

}

```

I would be very grateful if anyone could offer insight into why this is not being optimized properly/well. Thank you.

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzMWkFv6zYS_jXKhbAgyZZjH3JInHgRLLp7aNEFejEoaSSxoUg9krLjHPrbF0NKtmTLcfJe89DAcCuRnPlmOJz5hn5Ua1YIgDsvfvDixxvamFKqu98pZxlVLzeJzPZ3z4J48YMpmSY54-DFj160KI2ptTe996K1F63fWOEXMkskN75UBb7xovW3tyd4ktN__-JFS_JMSroFYkogueRc7pgoSN6I1DApiDZUGXxDDeFMAJnGIUoPHr2g-54H7vPGCvcmF0TAq1mVjXjxooWWjUrBm94jXC-69-JHylkhvGjx-9Nq8-vzH08IJJVCG9IsvGhFagXbDeiU1pDhQpz3_J_fcNoDMxWtNfFuH5w2Qki7lom6MZstpO0S4k0fSas9fgh8_6AufvQRyLQnwjmMvLGC5JVBCTLPTzV8a6QBTez8Iyyrx5sFCTMrqo0XLQ5IcKgdpdoZjPZ5s0DXnOLUbsQZ7kW3XhTZ7-XxMz0zNaHpi-ZUl6C_Dki8ws9VLIYmziVf6ZR4Za4jEbDDINVfjERcR5JSpRgtEMpXIlHXkeiapl8dsuQ6jEZkoHQq1RcG7OYDMOoa1IbyuqQjMDpFLteshsg-AeS-D4R40xXKP2an9nU0v6qz743VJ73xx6k3Rh3C5e4nOIT-Exzy9p5DnDcyVjBz8bD8Xd4I_gneWF4Jj5F6KPplvzdn8jf-tcJbyb8huUllBoRpkijZFKUhRpK9bEgqG2VA74nMiWZV9qeWglCRuYc3KdDQRJqScJaC0JCdAbdJyVKf-5qmJZDID7rZZFeytES9TKS8ySBDDoRzE2mMrFDtkXuNO-YZp4ACsqOayC2onMtdy28MiMyxLqawYJRU0dSAIkxjfTGEHov8xep_GqQ9WmAD6K8hkTo7_Y7x6XaGi_NocZDiwmpFQhuQt6e87OGC2f8CQzR8a0Bg5TlQSCkIbEGQBI9YsicpB6rsQOMcK7OMaFAMNGk0DnjRGWKUsEEJ71aRLtq9WfA7pEYq9_8Z2z690tQMmOeKLGz4ryz3XL57boLXOL6SUGWWbTrjN9Z2PdwatzEHO9zjcCNGxL4eklCW_Y-Z8r-HaFqMaLShfwiSUZxMbEGZTUX1y6krX5Ete_HjcfunD8cTCQTXkB1GtbCnUYFplMCjgrvoRfddhET3drMRjZCGuNfavxw5v6Bk2ILaE4knpxfQWe-IUE2oaMX13p-JW3NWW0wWcS5VLyxNSY2LzUFgIlYjSSqVgtSGZXUWgq3e9rT0djJ-Gri1zeLv7O2XJdBBVduVzIDlgRaz5et4mg90GR-OjBWfHGsco9hSbHBRooC-uMj-y4sW46IuRF4GwrCcgdIbqTaiqRJQTlSPJVpJPbpmn3tsxT67en0BZiPandq0bVsHtnu057AL1eVIVLaB7R97Tf_El8dHL1odF-KfP-Kq4avjEv8dn4wP9dZetHVs4BxmP6cfjn9obxN6U73bx6ODbh9PWv--4zBDePEDpSot5zMCFTP26Lli2V0sME0qqQ3fk0ZsbYpmb7gNK1vCaYI1IQyIYRVowqUobKGmwh7n18V8M5_h4XsDMcVzhlq-7wIEVWqAiomC70kCKW00kBSUoUx40RpRktqmcarA5rEEsDgdUSMxoJz75JkYxSAjGXCwJS9jeQ4KRCfBSNRFWE6ekb_wjCioZHv90k7BvLSDS6qSBoObVNRgItzh5OdWiPWeKZko9GUBCDJjGTFqTyjJYUcoN6AENWwLhFU1hwqEobhJ2nEcIPCtoZwATUu0IGHG5tNayRSyRoF_9ToI56-lWpVU9WlrW3gQdyqrGjfbZnN8b4vxsSz1b3v0jpm0tEylYdww4ad142O8OZ7ycBLhXSjajvLpfEKPWLWm4TE4INIm85FT-gpqsKRgPmtZwhnJWFziEWGf7a-OMpk09EAvFiOsYggwLdWmoiYtD9nhe_tl6-dRHjPUmAN01EMDB0ueOglHLJZwOOf1XXMOIviITupu7jofZcCEAcVtll1EKDEHuOKoqyLogeVED_bB5bz3hMJVoWlfaHpFaFteBswV-gLgHQGYhE9zOXANXZSPCf9EsAzD47QSfKAMPNump5Q15h9Maqhgx0xJKOHMGA6kBF67nAX2tDGOKc2mxQQITThguslZ2yhRXkjFTFmRZ0c9mWWfmbTB12AOpi3XJCkrqPLJr_KQk2lh8_nK1SE82HYZFIUD2INgJBawAhNiBW7-QfdYkUkUFWmZK4C2xkRBuPSidTDDr9CL1jkzWA8m1X5SAs0mprS97MSUMKGqmpSSg55INTE7OTlw1ImRE90k2jDTGJjkOF7CpGIau6NJXcltpV-SCRPaqMbW1YkUVqAA7H_XXjQtmDbLeHY7CyKslNdz9bamWeZFC3sf1M9t4XyQ3JJ3x5eXhsYu7reQ6qtH6zD2p8SNXFAH4jyXtJGPQvuHqX225-mcyXyiUl0pUF2ipmpzqCVjpeLEnd-RqD9Vs5y6n1OzHLA3UNLd9l4Mhbau9GvDyY1csjetjPhhPvPix2YxrHsjiumFgvWBXXDq7I9F4bz7mcghtf7tdvWk3rWmjsD_cSjh3Pen0Q9hOamKPwRnGvn-bPH3wMl-HM5s4fs2Lr4bzvDG8T7LHNHtqK_jw9r-oomVwQ2aEpRNSk2agtZsC3zfXh1o0-S5m7awMN3tAxNGEik6soxO2FJFdFMF1gvHtNultZHfcpoqHExOcXJ2OvlMJr5waKvwM5ODd9visV2ymbpBfrw8u63uBEbLNi1_sKHsWiVsZwowpNoTbx44uPPgrGfBPcKeE3fPUIUregUS-x_KtWxZgSlBAzE7-ZEfta3K6GeVxs7LmAFs3jvrW5C0kgGFJW3f7h6iwdC0PzQbDMX9oflg6Pb0IuCgeDGYt-yJCIe0uo8pHGAK-5jCAaYwHlOcDCxO-hYnA4uTvsXJwOKkb3EysDi5aHEysDjpW5wMLE76FicDi5O-xcnA4mTU4vE7l0E0Tn8uUdNlk18gGi0rOyUJLO7fsP_13vBYxqObD3FDeoEafWz1WbY9Jjg0OOdwqEwOz6Bz3PSJJUo7m5KcTkGpg9R6rZPqeiJ7M14oaiBvOGE5oWIvBbQJUuY5KEx2rCiNKzi7cu8aGKZ710ES985eJ9VK1qCwdq13wLlPfiupeCF72fg32d00W06X9AbuwvlyNo-XQRjdlHd5FgTBchqES6DpAuaLIJwtg9k8nUE-z6L4ht1FQTQNliH2HPM48m-neQiLLJjBjAZwG3uzACrKuM_5tsKe6YZp3cDdfD4Nb284TYBr-0-fokjAjthBL8KDdKPucM0kaQrtzQLOtNFHKYYZDnfddU9C0xcQGckp47YsdGa312HttRbVBwqNvrBV5KZR_G7Y5BXMlE3ip7LyojVqbP8zqZX805KXtcWpvWht7fh_AAAA__--Rbg4">