<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/59829>59829</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Inefficient code generated for `__builtin_convertvector` to a boolean vector on ARM.
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          lawben
      </td>
    </tr>
</table>

<pre>
    While playing around with converting vector intrinsics comparison results to masks, I noticed that the generated code for ARM is rather inefficient compared to how it could be. 

Link to godbolt for code: [https://godbolt.org/z/3jdd9b76d](https://godbolt.org/z/3jdd9b76d)

Considering the following code that compares to vectors and creates a mask from this, i.e., a boolean vector where each value is represented by one bit instead of 16 in this example.

```cpp
using VecU16x16 __attribute__((vector_size(16), aligned(16))) = uint16_t;
using MaskT __attribute__((ext_vector_type(8))) = bool;

uint8_t mask_from_builtin_all_ones_or_zeros(VecU16x16 x, VecU16x16 y) {
    auto matches = x == y;
    auto result = __builtin_convertvector(matches, MaskT);
    return reinterpret_cast<uint8_t&>(result);
}
```

This generates quite a lot of instructions and has worse performance than a more commonly used approach on ARM with NEON, which uses a mask and "horizontally" adds all values in a vector.

```cpp
uint8_t mask_manual_all_ones_or_zeros(VecU16x16 a, VecU16x16 b) {
    auto matches = vceqq_u16(a, b);
    constexpr VecU16x16 mask = {1, 2, 4, 8, 16, 32, 64, 128};
    return vaddvq_u16(vandq_u16(matches, mask));
}
```

This can also be applied to `uint8/32/64_t`  with similar masks/instructions.

This mask "trick" only works if the compiler can ensure that all bits are either 1 or 0, which it can after a vector comparison. To demonstrate this, the [godbolt example](https://godbolt.org/z/3jdd9b76d) also contains a x86 version, in which in clang generates different instructions based on this knowledge. 

A tiny benchmark on my M1 Pro MacBook shows that the `mask_manual_all_ones_or_zeros` code runs ~4x faster than the version generated by `__builtin_convertvector`. So while the NEON version is not always possible (unless the x86 path is chosen and you compare the vector to itself to ensure equality sets all bits to 1 or 0), it could be applied to improve performance on ARM in a common case.


</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJyUVkuP27AR_jX0ZRBDom1ZPviwjywQoJsWbdoeBUoaWYxpjpak_NhDf3sxlOzVbtKkAQxZpsl5fPPNN1Te651F3IrVvVg9zlQfWnJbo04l2llJ9WX771YbhM6oi7Y7UI56W8NJhxYqskd0gZePWAVyoG1w2npdeajo0CmnPVlw6HsTPASCg_J7L-QDfAFLQVdYQ2hVgNAi7NCiUwFrqKhGaMjB3d-fQXtwKrTI1rFpdKXRhtE8Hydo6QSal3pTQ4lzEMmjSO6G51-03fOmHdUlmRDNsn2xuAOxum9D6LxY3An5JOTTuGlObifk06uQT4vvdb0p11ktVo9C5v__drmZRvFA1usaHWPFuTZkDJ34V8w1YjCmFHEa8PSgbA2VQxXQg4roQePoAKHVEUY9xzl_KyiJDCp7rcSpRYeAqmrhqEyPEUbsHHq0DHF5AbIIpQ6grQ-oaqAG0gy0jcYBz-rQGZxPsxBZMnyqrhtWes85_Aurf6bZOc2gKFQITpd9wKIQMhcyHwIqvH5FIfM0Y2Q4YsPEq9-W-ANi8Qi9tiHNiiAW91Mnz8rvv_3MAZ5DMToJl46d5O8NMjQ3Y6NJbUNehIhowYgWZa9N0LZQxhRk0Rfkild05IXM3_I7c-hvPy_RxXo0DQCg-sjyULXoo_MzP_nlcgvhtm9ojLituAUwdtWQkZD5aIwdRwg4saklh6F33GTaBnSdw1BUygexeBiTFDITi89C5oO76XmxfvxQ2SlI35gH16708NLrgKDAUGCuMG1cXwVNduBpqzycyHmEDl1D7qBsFaltmbrkkBl-IGsu0HusQXWdIyYo2djoUVO-fv7rV0711Oqq5X033rMLIWVLTr-SDcqYi5ASVF17UMYMNPfMXzU2we-4O-XAQdlemd9UX72vfvn76h8rfHkpeqZ4Hk-XH8tXEbffuXMTwzFfPi7W9ymfkvxY8iPnB5t7gEVczeJyKnOu5U94cVR1fbyGcFS2vr5PeMX-xpb5A2ZUXFjjCUrkWho9qLHIkogsq6EU8ilbFkFkCQz19fqgjXLXQfA0ZdH8BxcDEFIGp6s9Vzuy50Ru70E3UUlZNbVBF8NB63s3yilzotTBg2Ih1HGCpEAOkjd-8dTgLJqA7kabyeiawzeCGg9cI26Cm-6yZ7G6vw6VUSz_fEYMAFbMZ81tBOc8gyM6r8lGfbfXSC1URtndpB9r3TToeBi-a8VScXPRKON7SyeD9e7DVLyDoO0FSrRVe1Buz_sPF3hO4W-O4FlV90R78C2d_NuAFlny617JkmGcud56-M_yDI3yjGzUALYwZjYZ9eWFzf4v9cuSOfyDGAKD0QDLw82K9nyJAGVO6uKhI-91aRCEzHtr0Pt4ggHtVGh5d9WSRxuV5EL9deCOkcXSBwIdPJqG30Y24UuvjA4X8Bj8G60C3egUB9rkBjJtB33oHB3fa-IoeFGqBk2ESvn3s3ZWbxf1ZrFRM9ym2Vrmy3yR5LN2m5dlkqm0bJoU1bLZyFVZrpZlLevVJkvzZqa3MpGLJE2WaZaskvVcblZZskyWi_Umb9arVCwTPCht5sYcD0zMmfa-x-1qk8vNzKgSjY8XQiktniD-KaTk-6Hb8plPZb_zYpkY7YN_sxJ0MLj98u6WVk_vdXz3-mW1GbAfrjIDWvNZ78z2Q2_p0PblvKKDkE8cxvj1qXP0HavA-sLBs9DE5P4bAAD__0Y-kCw">