<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/119386>119386</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            x264 performance regression since 19.1.5 with rva22u64_v
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            backend:RISC-V
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          lukel97
      </td>
    </tr>
</table>

<pre>
    525.x264_r compiled with -O3 -flto -march=rva22u64_v is 15% slower compared to 19.1.5 on the spacemit-x60: https://lnt.lukelau.me/db_default/v4/nts/70?compare_to=69

The regression was introduced in #105858, which caused SLP to start vectorizing the horizontal reductions in `x264_pixel_satd_*`.

These end up producing several m4 vrgather.vvs, which are quadratically expensive. If you view the profiles on LNT, you'll find that `x264_pixel_satd_8x4` gets inlined and is then vectorized in e.g. `x264_pixel_satd_16x16`. You will see spikes in the u_mode_cycle counter around the vrgathers

>From what I can tell, #105858 isn't doing anything wrong but just happens to improve the cost model enough such that it triggers the unprofitable vectorization. It's correctly adding the quadratic cost of the several two source permutation shuffles. 

We could try and disable the vectorization again by tweaking the cost of vrgather_vv to be more expensive: At lmul 4 we cost it at `1*4*4=16`, but on the spacemit-x60 this is closer to 64 because at LMUL 1 the reciprocal throughput is 4 according to https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.html. I'm not sure if this holds for other microarchitectures.

Alternatively, we could try and improve the codegen. These functions seem like they could be eventually profitably vectorized, given the relatively large VF=16. 

One idea is that one of the vrgather.vvs comes from this shuffle:

```llvm
%52 = shufflevector <16 x i32> %51, <16 x i32> poison, <16 x i32> <i32 3, i32 2, i32 1, i32 0, i32 7, i32 6, i32 5, i32 4, i32 11, i32 10, i32 9, i32 8, i32 15, i32 14, i32 13, i32 12>
```

This could be done as an adjacent element swap at e32, and then a second one at e64 i.e.:

```
     -> ->
0           1              3
1           0              2
2           3 1
3           2              0
4           5              7
5 4                6
6           7              5
7           6    4
8           9              11
9           8              10
10 11              9
11          10             8
12          13   15
13          12             14
14          15             13
15 14              12
```

We can do an adjacent element swap at e32 with vror, and at e64 with a vslide1up + masked vslide1down. 

Finally, it's worth noting that GCC also vectorizes this but as several m1 vrgather.vvs. This is probably profitable because they're not quadratic: https://godbolt.org/z/av4vGqjsG. Is splitting up the reduction into several M1 sized sub-trees something SLP could feasibly do? cc @alexey-bataev 

The extracted kernel is below

```c
#include <stdint.h>
#include <math.h>

// in: a pseudo-simd number of the form x+(y<<16)
// return: abs(x)+(abs(y)<<16)
static inline uint32_t abs2( uint32_t a )
{
 uint32_t s = ((a>>15)&0x10001)*0xffff;
    return (a+s)^s;
}

#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
    int t0 = s0 + s1;\
 int t1 = s0 - s1;\
    int t2 = s2 + s3;\
    int t3 = s2 - s3;\
    d0 = t0 + t2;\
    d2 = t0 - t2;\
    d1 = t1 + t3;\
    d3 = t1 - t3;\
}

int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 )
{
    uint32_t tmp[4][4];
    uint32_t a0, a1, a2, a3;
    int sum = 0;
    for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 )
    {
        a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
        a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
 a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
        a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
        HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 );
    }
    for( int i = 0; i < 4; i++ )
    {
        HADAMARD4( a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i] );
        sum += abs2(a0) + abs2(a1) + abs2(a2) + abs2(a3);
    }
    return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
}
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJyUWN2O6zYOfhrNDZHAku38XMxFZqbpHuB0u2i7XexVoFhMrDOylEpyfvr0C8p2Emfm4GCDGUQmKfITSX2SI0PQe4v4zMoXVr49yTbWzj-b9h3Ncv60deryXIpyehazYuOhcs1BG1Rw0rGGya85THYmOpg00lc1y9_8UQrRzorNEXQAXjJRQjDuhN1c6VFBdMCXUz4twVmINUI4yAobHSfnWcbyFdQxHgLLV0ysmVgbG6cJj2ynDTKxVtuNwp1sTWRifSyYWNsYmFjPM5av-yib6Fj-NluybMWy1R81gse9xxC0s3CSAbSN3qm2QgXaAhM5z8pFuWDiFU61rmqoZBtQwe9f_0WAQ5Q-whGr6Lz-W9t9Al7Tg7NRGvCo2ipqZ0PyN8tSxg76jGYTZFQbJlZslk2vgAICWgXtAQ4JCPkMeEQvDTQFHP1exhr99HgMN1DSI_zVSuVl1JU05gJ4PqAN-ohT-LKDi2vhqPGU0B2822mDgdL89Z9_kJeLa5mYGwM7bRXEWsbPoC7OBZtlsMdIizHaogJpFVU01mivaehyh9P99DMvfHbmM1oy_Ne1cNLGQEAqtn7HlCTC2G4ap3BTXSqDULnWRvQgvWsTPLymIXR5W3vXwIlgf4FKWohoDK3rWj_QwTIxj6AcJVTaS6xpcPLO7mHbRvjWhgi1PFDaqLK6OXh3xBStciECATKA1rX7GkJb1V2edITo9X6PPnTIbcpvlFuD14xI6oApfIlMzANUznusormAVGromWv5umhu122BvvLx5CC41lcIB_RNG5NHCHW72xkMU-jy8J-ULKMg-ksqjdIhAUk5uwcDci-1he0F4gnl-4BiiD3kd3M8UjK2CI3zeOsq2o6rCKZpDRRw6ifqCF3ncCZWRfrP31KxqRiU5U92NsRaB-qhyriAnsLNCthi2mjk7-sv__4KPM3zWOmDdxVlpPZUiUMbaW4Bsqqc77LpHpiikg2aSaX8dK9j3W6n2jGx9sfjZIu2qiceQ2sSVWwPerPLmVhrq_A8rWNjpvCFiXkD1kUIrUfQuw5x7YwKsHMeHGUKGl15R3SnI1ax9Rj6Tb0yEb2VUR_RXNKefazSuNcU7tFOoeOCXWt7-giIDRj9nswuvYctAh7RxjZt-mvnXe72IkXc6yPaPoOmRwJG-j3Cn-tUo6GDfrUIWqHsNrWkiuHQjPfcQ7SNAXa08VI6-lakpCdHVPT0Z8yxoWdRlgJY_jZYdgiB5a98BmfQuWD5T0BmPG3dsfzgdHD2EwXLX3UuICcVDcQw4MMgGwbzYTAbBuUwKK6zrtP4dd5yGCyuuutEfpt5hcAJ2H0KBnKnJh_KpiixMoC0INU3WaGNgAYb-g4neaDOxzytRnakZ0FCwMpZlYpC-lkBeorTj0ln2QroM6EUTTo4Gdw-HEafnGWre1E2VguWrcS9OXCWrfJ7i_EEil_cPZcsW949zlm2KqEYC2csW83GRsuxi9V8bL4kF6vFnXA5nsMJ5r1g8aAnnDxLdsuxmxXnY7sHNyt-t2Sek54TRH6XFS4ewhFcfpcX_pAXnupQJsORXHzsJqJ6aUG5H3VQdyc7eueHXuo7J8klHIPRCnl7ACZeoJHhHdUgVO5kB2ZYa0skk5q8O8dOzseaiLE7PWSEn19fQZrgbvQTOnYg7pfhdpHhIzIhsuvOgIN328Rfd4focBQQ7TEx95i4-HpefrwZ7p3aOhOnzu-ZWP_NxFoei-PPf30LP0_hS4BwMDom0O2hJ8X-kkbXP3dF-QuHkG4zod1OokcMEFyD3c2BLoDdZt6hDJowK8fyNVQVsCKTBs94mWxllHiE23UTz9HLKqKCd_QWDS16i8adHjZwlSgz17YyrUJiuRCVtnFa99wy0jUy1ldN1qcBtKXcSDgEbJWbBN0osG2zRT8w-s75Bs5MvDCxuLD8NXErE8ubD4-x9Z2fbWBicSZtsu-eL_Q8mhhiusV0F0RotY252ESaLphY3AmgDzR_Iba6ykM6JJhYUAxaUv4T0e2SiVl25llGl4slE6vsvNvtdix_6bmuQwpplngJZFT-FDo9m78NmckV7gjYP1Zvq19Wv70VTCxUonqVqF8lylWJzEOShyQPSR5yJpZAkMvXPq62EWLWnWxZ2kOBU9TOIGn5oJ2MdMPk_lgU3eT8o0E-GEwe1aoLHLvAUYyVYlBOPqg6SISM5j04zQfl5F51TSJh-uzdoK_uYhOBidVBn7uz1EbQm-FpbCBGBmLcEQC3pojNgZUvBSvfhq_8g41M1ZIpjOwOzvxmRlFC26SVZTfxjmhx0WG46tLwFYo0TN3-Qu5oDZQvMrutqAd-kw7LIPe3pdBHZkNvp8nlS8bKN5gkF_1Tai_x0vd_b1aMzIrBLG076PZdPorDH-LwkQP-vTjlyKz8fhwpHgKI0UzxvQCzkdnsxwvJH-LkIwf59-LMR2bzH8a5J4O-13TXa11NXsdC_plQfCbMe2HqzdSaqTNpXfcYuq31__Tjd3tstJRPt8QVY9Zh1CPg_DOh-EyY34TwIaNpq3Wbomd-wtKVahDwR4F4FOTfydKN67s_4gA-20QmlqFtxi0x8EOv6w6U1AKpJegR-OiYGM7gJ_Wcq2W-lE_4zOd5vigyXsyf6udC5lwVueTlbJZVpRTbcslVIeQSCz5Xuyf9LDJRcMEzXvCymE1nQuS5FLPlLNtu-XbJigwbqc2U3o7oovKkQ2jxmfNlvpg9GblFE9KPb0JsZfWOVrF89duX318nfzJB5XjyzzR3sm33gRWZ0SGGm7eoo8FnImk4oKdjXtpq9GNX0CTof3BLV8Hbb3RPrTfPD3eq7s25cg0T6_RK131NDt59wyrSOzOtgN6i-0Ucn8X_AgAA__8ff79e">