<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/59686>59686</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            [aarch64] `bitcast <N x i1> to iN` produces bad assembly

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          Sp00ph

      </td>

    </tr>

</table>

<pre>

    Essentially, trying to recreate the Intel `movemask` intrinsics on aarch64 produces extremely long assembly. Rust's nightly [`simd::Mask::to_bitmask()`](https://doc.rust-lang.org/std/simd/trait.ToBitMask.html#tymethod.to_bitmask) suffers from this.

Given the following IR:

```ll

define i16 @movemask(<16 x i8> %mask) {

    %bits = icmp slt <16 x i8> %mask, zeroinitializer

    %ret = bitcast <16 x i1> %bits to i16

    ret i16 %ret

}

```

On x86-64, it compiles down to just one instruction, as expected:

```asm

movemask:

        pmovmskb        eax, xmm0

 ret

```

On aarch64 however, it takes a whopping 50 instructions to do the same operation:

```asm

movemask:

        sub     sp, sp, #16

        cmlt    v0.16b, v0.16b, #0

        umov    w8, v0.b[1]

 umov    w10, v0.b[2]

        umov    w9, v0.b[0]

        umov w11, v0.b[3]

        umov    w12, v0.b[4]

        umov    w13, v0.b[5]

        and     w8, w8, #0x1

        and     w10, w10, #0x1

 and     w9, w9, #0x1

        and     w11, w11, #0x1

        and w12, w12, #0x1

        and     w13, w13, #0x1

        bfi     w9, w8, #1, #1

        umov    w8, v0.b[6]

        bfi     w9, w10, #2, #1

        umov    w10, v0.b[7]

        bfi     w9, w11, #3, #1

 umov    w11, v0.b[8]

        bfi     w9, w12, #4, #1

        umov w12, v0.b[9]

        and     w8, w8, #0x1

        bfi     w9, w13, #5, #1

        umov    w13, v0.b[10]

        and     w10, w10, #0x1

 orr     w8, w9, w8, lsl #6

        umov    w9, v0.b[11]

        and w11, w11, #0x1

        orr     w8, w8, w10, lsl #7

        umov w10, v0.b[12]

        and     w12, w12, #0x1

        orr     w8, w8, w11, lsl #8

        umov    w11, v0.b[13]

        and     w13, w13, #0x1

        orr     w8, w8, w12, lsl #9

        umov    w12, v0.b[14]

        and     w9, w9, #0x1

        orr     w8, w8, w13, lsl #10

        and     w10, w10, #0x1

        orr     w8, w8, w9, lsl #11

        and     w9, w11, #0x1

        umov    w11, v0.b[15]

 orr     w8, w8, w10, lsl #12

        and     w10, w12, #0x1

 orr     w8, w8, w9, lsl #13

        orr     w8, w8, w10, lsl #14

 orr     w8, w8, w11, lsl #15

        and     w0, w8, #0xffff

 add     sp, sp, #16

        ret

```

aarch64 doesn't have a `movemask` instruction like x86-64 does, but it's possible to simulate its behavior using way fewer instructions, e.g. like so:

```asm

movemask:

        ushr    v0.16b, v0.16b, #7

 usra    v0.8h, v0.8h, #7

        usra    v0.4s, v0.4s, #14

 usra    v0.2d, v0.2d, #28

        umov    w0, v0.b[0]

        umov w8, v0.b[8]

        bfi     w0, w8, #8, #24

 ret

```

(compiled from https://stackoverflow.com/a/58381188 )

I'm not at all familiar with codegen, but I would hope that it's possible to use some clever algorithm to create assembly that's closer to optimal for all vector lengths.

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJycV0uP4yoT_TVkU-rI4EecRRbT05NPvfjmSnPv_grb5ZhpbCzASWd-_RV-JOSdmVYLnORw6lRR4CpujNg0iCsSv5L4bcY7Wym9-rsNgraaZarYr74Zg40VXMo9YV_B6r1oNmAVaMw1cotgK4T3xqIEkgS12mLNzQdJAhCN1aIxIjegGuBc51USQatV0eVoAD-txhrlHqRqNsCNwTqT-zn86IwlbGGgEZvKyj04dUlgRF2Q8AsJv_zfGeifrPo3E7Y3yFLCliQJSPxGWFpZ2xqHYWvC1oXK57oz9kXyZjNXekPY2tjCjY6Ura3mws7_Ua_COvJ5ZWtJWGj3NdpKFXPfzhJMV5aoDZRa1WArYeYkeCPBl2H8n9hi00elVFKqnYvX-w-nZQAlwfAv5fBFgaVoEARNgETHALKUhF9pAp8gUhJ-A8LiyT5ZvA5LAcB9nwlrgIRvIPK6BSMt3Fj6FX6hVqIRbkPFL9QnNBptz5IJm3PjkdCRpLdjlZN6XOhW9dp7gtHHxduZs8PHvxr4TJOXJHJShIVc1a2QaKBQu8Yx_-yMBeWi0Riru9wK1Tgsd-nSYm6xuAwkN_XwzSF2EwTGv7ZW29p8ZNNn5J-O9bOuR2FwlH6heErcSu1wi3qUbvkHGuCwq1Tbui2OA190H6dC9WlgeI2gWtS8d-cP9JtukG5aZ30YCQv9bXB_eS2tm7fBnCaZAx2fCAuDU3RXq62bd-mIzEj8St3xGWCH32ngAdgRcM6z9GDBddiOUg8U3uaizMNFd3Chh4svcLwp4OjjMLpIfNIbuMHXcTpBHiC9m8N4n4oOVPQecnR0nO7zhQMyvIXMSuErnJylh_nx9icXETwjPQSGPWA9SZrFQ9pJZHhGe-TzEyd9yDfJi-7IPE2y5R8nz7ntyYf4UYj83KWXJ-aJpFRa-xK9jZdGOmjyxFml9Krpxwl8Zj31VI7mF1fD7qcGvbxQjn4_OhvXBVBPQHor9n460cuL6PlTd10D8zQsn7nj6OUl9-yNc11B6CmgwW9n1j3upU9967paPkieG1vh3eGP04uyB35dZs5jh8LfTXIa3RXspyONbwgOzq6YsizL6cVTDJgHL_9bFcwwTkVModA0hC0sVHyLwC9r9kMJA1J84Fiv9euc4ayzIIbqvFXGiEyiq3SMqDvpmgFXIWZY8a1QGjrjCqMd30OJO9Qn9ZEjw_lmPlgx6lgW_W5x1JmqD_uNqme6gjqj-QhLqxE1PFy5p47YyIzY4cHfbg_FihE1PLj3462bJ3iiSkqfe9edJs00s-hBTUtYOhbexdDCnDZLxvL8Q21Rl1Lt5rmqCVtzwtZxGqaUpim4PsvbqnfCFjU0ygK3wKWEktdCCq5hJ2wFuSpwg82UPO-wU50soFKtax351XTqjEuJGiGXruQGLjdKC1vV7sex7Zz6xZ6kZ8ilMqgdRLVW1FxCqXSvaIu5VRokNhtbnXZrs2IVFstwyWe4osmCpmESx_GsWvE0DxOGi0VI4zTAoozjAuMMKYYhzxbRTKxYwBhlLGRhwMJ4npdplBeLNMg4j6JFRKIAay7kXMpt7brOmTCmw1W8TNJkJnmG0vSdN2MN7qD_kTD3PpzplVvzknUbQ6JACmPNkcUKK_uWfTzUJH5zx9hr274fujbXrX13B_vQdme8OMRu1mm5Ot39jbBVl43b7kyO00ur1U_MLWHrXqhxGeEc-S8AAP__ue9G6A">