<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/95860>95860</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            AARCH64: Non-SVE popcount autovect for 32bit and 64 bit could be improved using v8.4-a's udot instruction
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          pinskia
      </td>
    </tr>
</table>

<pre>
    Take:
```
void f(int *a, int b)
{
 for(int i = 0; i < b; i++)
          a[i] = __builtin_popcount(a[i]);
}
```

Currently LLVM produces (for -O2 -march=armv8.4-a  -fno-unroll-loops):
```
.LBB0_4: // =>This Inner Loop Header: Depth=1
        ldr     q0, [x11]
        subs    x10, x10, #4
        cnt     v0.16b, v0.16b
 uaddlp  v0.8h, v0.16b
        uaddlp  v0.4s, v0.8h
        str q0, [x11], #16
        b.ne    .LBB0_4
```

But this could be improved to:
```
     movi v1.16b, #1
.LBB0_4: // =>This Inner Loop Header: Depth=1
        ldr     q0, [x11]
        subs    x10, x10, #4
        movi   v2.4s, #0
 cnt     v0.16b, v0.16b
        udot  v2.4s, v0.16b, v1.16b
        str q2, [x11], #16
        b.ne    .LBB0_4
```

That is generate :
```
movi v1.16b, #1
movi   v2.4s, #0
cnt     v0.16b, v0.16b
udot  v2.4s, v0.16b, v1.16b
```
which is one extra instruction but will be pipelined better.
64bit still has the last `uaddlp`.
This came up during the review of the GCC patch and I thought it would be good to file this here too.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzMVU1v4zYQ_TX0ZWBBoj4iH3Sw7E13gbQF2mCvASWNLXZpUiWHSvbfF6TkNtms2x56qCBoRuAbcua9ISmck2eN2LCyZeVxIzyNxjaT1O6LFJvODF-bR_EFWb5n6ZGle1al6xt_ZyMHODFeS03A-F4wfoDgd4zv1oi7dnHgZOyKlMDyI6Qsb6N7gC66jLfxXSPhz0ewspWsPMawp6fOS0VSP01m6o3XxHh9RYTgvL2ufPxu0sv34K1FTeorPDx8_hEmawbfowPG65OxsP2Zw_YibD-y_CjsZa6TYisAtidttl5bo9RWGTO5uOL32Uke2jZ9Kli-B8bvGb8P-bP8w-MoHXzSGi08GDPBRxQD2gA74kRhwewtA2qw0f6eBn5Z2b5kWSj2Dcj5zgX7kkXUahjPi7e4XlO0c5pkVRcwq7egvBgGNcXhenw3uj6vQIVbQfX4TT5k3yW8JJRVb5FdojHYK1-3RWs9AQX2euPVAB2CvEzWzDgAmVsyxDUuZpYwZ9eSQxb_Z5FiugAzX_llPL9W8w_6XRUaDL2KfwXO3oOjVPy_lepxFATSwRk1WkEIt-S5qcxtDv6egn9V-jdJPI-yH0O6RiPgC1kBUjuyvidpNHSe4FkqFTpukhMqqTG0HxHaZJmhKjpJ4CigRuGARgQlHAGr0mW7sCpNrtyEFhYXBD_B4K3U54i3OEt8BnOKfz8cDjAJ6kcQeoBPQKPx55FAEjxf2_9sTGh9OEmFy84Y0SKQMclmaPJhl-_EBpvsLqurqijSfDM2YjfwihfdLqvrusbTLs_rIiuzvhRZ3xXlRjY85UVaZXecF1WRJ2Xfi1ScqgF5VXZ3HStSvAipEqXmS2LseSOd89jsyrpKN0p0qFy8UjjX-AxxkHEebhjbhJht58-OFamSjtxfs5Akhc1-_8vhYxV35E9Gb3_9_AGuJz0IT2bGnsJlAjkPlAdyqgKC-_5U8C5wuxzejN-5ZV-8knbjrWpGosmF_owHwFnS6LukNxfG70Nuq9lO1vyGPTF-HytyjN8vFc8N_yMAAP__uLL9hg">