<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/59686>59686</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [aarch64] `bitcast <N x i1> to iN` produces bad assembly
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Sp00ph
      </td>
    </tr>
</table>

<pre>
    Essentially, trying to recreate the Intel `movemask` intrinsics on aarch64 produces extremely long assembly. Rust's nightly [`simd::Mask::to_bitmask()`](https://doc.rust-lang.org/std/simd/trait.ToBitMask.html#tymethod.to_bitmask) suffers from this.

Given the following IR:
```ll
define i16 @movemask(<16 x i8> %mask) {
    %bits = icmp slt <16 x i8> %mask, zeroinitializer
    %ret = bitcast <16 x i1> %bits to i16
    ret i16 %ret
}
```
On x86-64, it compiles down to just one instruction, as expected:
```asm
movemask:
        pmovmskb        eax, xmm0
 ret
```
On aarch64 however, it takes a whopping 50 instructions to do the same operation:
```asm
movemask:
        sub     sp, sp, #16
        cmlt    v0.16b, v0.16b, #0
        umov    w8, v0.b[1]
 umov    w10, v0.b[2]
        umov    w9, v0.b[0]
        umov w11, v0.b[3]
        umov    w12, v0.b[4]
        umov    w13, v0.b[5]
        and     w8, w8, #0x1
        and     w10, w10, #0x1
 and     w9, w9, #0x1
        and     w11, w11, #0x1
        and w12, w12, #0x1
        and     w13, w13, #0x1
        bfi     w9, w8, #1, #1
        umov    w8, v0.b[6]
        bfi     w9, w10, #2, #1
        umov    w10, v0.b[7]
        bfi     w9, w11, #3, #1
 umov    w11, v0.b[8]
        bfi     w9, w12, #4, #1
        umov w12, v0.b[9]
        and     w8, w8, #0x1
        bfi     w9, w13, #5, #1
        umov    w13, v0.b[10]
        and     w10, w10, #0x1
 orr     w8, w9, w8, lsl #6
        umov    w9, v0.b[11]
        and w11, w11, #0x1
        orr     w8, w8, w10, lsl #7
        umov w10, v0.b[12]
        and     w12, w12, #0x1
        orr     w8, w8, w11, lsl #8
        umov    w11, v0.b[13]
        and     w13, w13, #0x1
        orr     w8, w8, w12, lsl #9
        umov    w12, v0.b[14]
        and     w9, w9, #0x1
        orr     w8, w8, w13, lsl #10
        and     w10, w10, #0x1
        orr     w8, w8, w9, lsl #11
        and     w9, w11, #0x1
        umov    w11, v0.b[15]
 orr     w8, w8, w10, lsl #12
        and     w10, w12, #0x1
 orr     w8, w8, w9, lsl #13
        orr     w8, w8, w10, lsl #14
 orr     w8, w8, w11, lsl #15
        and     w0, w8, #0xffff
 add     sp, sp, #16
        ret
```

aarch64 doesn't have a `movemask` instruction like x86-64 does, but it's possible to simulate its behavior using way fewer instructions, e.g. like so:

```asm
movemask:
        ushr    v0.16b, v0.16b, #7
 usra    v0.8h, v0.8h, #7
        usra    v0.4s, v0.4s, #14
 usra    v0.2d, v0.2d, #28
        umov    w0, v0.b[0]
        umov w8, v0.b[8]
        bfi     w0, w8, #8, #24
 ret
```
(compiled from https://stackoverflow.com/a/58381188 )

I'm not at all familiar with codegen, but I would hope that it's possible to use some clever algorithm to create assembly that's closer to optimal for all vector lengths.


</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJycV0uP4yoT_TVkU-rI4EecRRbT05NPvfjmSnPv_grb5ZhpbCzASWd-_RV-JOSdmVYLnORw6lRR4CpujNg0iCsSv5L4bcY7Wym9-rsNgraaZarYr74Zg40VXMo9YV_B6r1oNmAVaMw1cotgK4T3xqIEkgS12mLNzQdJAhCN1aIxIjegGuBc51USQatV0eVoAD-txhrlHqRqNsCNwTqT-zn86IwlbGGgEZvKyj04dUlgRF2Q8AsJv_zfGeifrPo3E7Y3yFLCliQJSPxGWFpZ2xqHYWvC1oXK57oz9kXyZjNXekPY2tjCjY6Ura3mws7_Ua_COvJ5ZWtJWGj3NdpKFXPfzhJMV5aoDZRa1WArYeYkeCPBl2H8n9hi00elVFKqnYvX-w-nZQAlwfAv5fBFgaVoEARNgETHALKUhF9pAp8gUhJ-A8LiyT5ZvA5LAcB9nwlrgIRvIPK6BSMt3Fj6FX6hVqIRbkPFL9QnNBptz5IJm3PjkdCRpLdjlZN6XOhW9dp7gtHHxduZs8PHvxr4TJOXJHJShIVc1a2QaKBQu8Yx_-yMBeWi0Riru9wK1Tgsd-nSYm6xuAwkN_XwzSF2EwTGv7ZW29p8ZNNn5J-O9bOuR2FwlH6heErcSu1wi3qUbvkHGuCwq1Tbui2OA190H6dC9WlgeI2gWtS8d-cP9JtukG5aZ30YCQv9bXB_eS2tm7fBnCaZAx2fCAuDU3RXq62bd-mIzEj8St3xGWCH32ngAdgRcM6z9GDBddiOUg8U3uaizMNFd3Chh4svcLwp4OjjMLpIfNIbuMHXcTpBHiC9m8N4n4oOVPQecnR0nO7zhQMyvIXMSuErnJylh_nx9icXETwjPQSGPWA9SZrFQ9pJZHhGe-TzEyd9yDfJi-7IPE2y5R8nz7ntyYf4UYj83KWXJ-aJpFRa-xK9jZdGOmjyxFml9Krpxwl8Zj31VI7mF1fD7qcGvbxQjn4_OhvXBVBPQHor9n460cuL6PlTd10D8zQsn7nj6OUl9-yNc11B6CmgwW9n1j3upU9967paPkieG1vh3eGP04uyB35dZs5jh8LfTXIa3RXspyONbwgOzq6YsizL6cVTDJgHL_9bFcwwTkVModA0hC0sVHyLwC9r9kMJA1J84Fiv9euc4ayzIIbqvFXGiEyiq3SMqDvpmgFXIWZY8a1QGjrjCqMd30OJO9Qn9ZEjw_lmPlgx6lgW_W5x1JmqD_uNqme6gjqj-QhLqxE1PFy5p47YyIzY4cHfbg_FihE1PLj3462bJ3iiSkqfe9edJs00s-hBTUtYOhbexdDCnDZLxvL8Q21Rl1Lt5rmqCVtzwtZxGqaUpim4PsvbqnfCFjU0ygK3wKWEktdCCq5hJ2wFuSpwg82UPO-wU50soFKtax351XTqjEuJGiGXruQGLjdKC1vV7sex7Zz6xZ6kZ8ilMqgdRLVW1FxCqXSvaIu5VRokNhtbnXZrs2IVFstwyWe4osmCpmESx_GsWvE0DxOGi0VI4zTAoozjAuMMKYYhzxbRTKxYwBhlLGRhwMJ4npdplBeLNMg4j6JFRKIAay7kXMpt7brOmTCmw1W8TNJkJnmG0vSdN2MN7qD_kTD3PpzplVvzknUbQ6JACmPNkcUKK_uWfTzUJH5zx9hr274fujbXrX13B_vQdme8OMRu1mm5Ot39jbBVl43b7kyO00ur1U_MLWHrXqhxGeEc-S8AAP__ue9G6A">