<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/57210>57210</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [RISC-V] Auto-Vectorization Generates Epilogue Scalar Code Rather Than Stripmine
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          lidawei1226
      </td>
    </tr>
</table>

<pre>
    A simple add loop:
```
void add(int n, int* restrict x, int* restrict y)
{
    for (int i = 0; i < n; i++) {
        x[i] = y[i] + 10;
    }
}
```
compiled with command:
`clang --gcc-toolchain=/opt/riscv -march=rv64gcv -menable-experimental-extensions -mllvm --riscv-v-vector-bits-min=128 -O3 -S loop.c`
generates the following assembly:
```
        .globl  add                             # -- Begin function add
        .p2align        1
        .type   add,@function
add:                                    # @add
# %bb.0:
        blez    a0, .LBB0_8
# %bb.1:
        li      a3, 8
        bgeu    a0, a3, .LBB0_3
# %bb.2:
        li      a6, 0
        j       .LBB0_6
.LBB0_3:
        andi    a6, a0, -8
        vsetivli        zero, 4, e32, m1, ta, mu
        mv      a4, a6
        mv      a5, a1
        mv      a3, a2
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
        addi    a7, a3, 16
        vle32.v v8, (a3)
        vle32.v v9, (a7)
        addi    a7, a5, 16
        vadd.vi v8, v8, 10
        vadd.vi v9, v9, 10
        vse32.v v8, (a5)
        vse32.v v9, (a7)
        addi    a3, a3, 32
        addi    a4, a4, -8
        addi    a5, a5, 32
        bnez    a4, .LBB0_4
# %bb.5:
        beq     a6, a0, .LBB0_8
.LBB0_6:
        slli    a3, a6, 2
        add     a1, a1, a3
        add     a2, a2, a3
        sub     a0, a0, a6
.LBB0_7:                                # =>This Inner Loop Header: Depth=1
        lw      a3, 0(a2)
        addiw   a3, a3, 10
        sw      a3, 0(a1)
        addi    a1, a1, 4
        addi    a0, a0, -1
        addi    a2, a2, 4
        bnez    a0, .LBB0_7
.LBB0_8:
        ret
.Lfunc_end0:
        .size   add, .Lfunc_end0-add
                                        # -- End function
```
where .LBB0_7 denotes the scalar epilogue code that slows down the code and increases code size at the same time (also we can see interleaving did not utilize LMUL).
>From my perspective, the optimal code with stripmine approach should be something like this:
```
.LBB0_4:   
        # a4: n
        # a7: current vlen in this loop
        # m8: LMUL=8
        # ta: tail agnostic
        # mu: mask undisturbed
        vsetivli    a7, a4, e32, m8, ta, mu 
        vle32.v     v8, (a3)
        sub a4, a4, a7
        slli a7, a7, 2
        add a3, a3, a7
        vadd.vi     v8, v8, 10
        vse32.v     v8, (a5)
        add a5, a5, a7
        bnez a4,.LBB1_4
```
What is missing in the vetorizer?
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy1Fttuozj0a8iLRQSGBHjIQ9O0uyN1tdJ0ZvZxZOCUeMZgFhvS9uv32IRwSWbUlyWI2D53n2sq87fdHVG8rAUQludESFk7wZ3jHRzvztl659duO8lzg-TQmFeaVA69J7hw6B1pQOmGZ5q83jp8c2hy5hjt-wXB50U25MyLEyc4EM8J9nZ5j8zN0qF7-yZkRmieV2ez587mYAnfLhu6J75hMyI70WGQfbhpVibLmgvIyYnrI8Fdyap8egeZYFVBXLfIMldLKbIj46jfwaGPskZLHxuuso64JWuyI5433TYs7AFULBXgwmsNDcedZgI3GirFZaUQQYiuRM6WgYs_yLRs3JRr5ZZWhk9j4v4dEPfZumadXdQuoIKGaVBEHwHvUgh54qgnUwrKVLz9youOl6wLIVOBC-Px3z0ORcku2UPBK_LSVplGvW0IXFjVlAleVLj0x0P9VkPPHsPBCb2BtMcwx8HdbwVP5CP5KNDs6SZN195onpfgHb8bcZ4JvvXTfu99j5cE_pRAcIMeGPR45FJAe-HSA3tewZIXvea1NejjDf8wt2CJt_3ZwGlCiFE2kvZS3VGbToHmnWX-Do000NB8IKDmr_TNVzO7bi9UZWc4WkS2XZxu7Km_OLV2MjrVMvyAd-xlYAoED1-OXJFPFQYjecIIJX8Cy6ExLA5Qa5MPo0h0pDU5Gq_YH9XsBBq3Nlp1sQFhbTA4yS2EZECIpghz_pslfwSvO37h33997xaC5d9_pwjqWsPNTEP1QQ2D8QYCegXtHRguQmKAbkbrJrRp1SdBOAZuuAzczSxr4N9F-M1SZwjfCYUSk8TpCWe6m69_jrLevAWQnoNtDlRtOuadN43dXofof4tHcbqY4xlX0aWrTgtfTWJBLWj9W26e3EZ4BZwY7F7nyOSuwisnT9wVTa8qnrqrAT3ATAH-DlU-K5trxd_HMk0mWO6l4n6kSg_3j53iocrJvNgvus_pCA0MmpMcKjl0MJUxwRoC2Itl0QI24hwQwDRR2NkUyeWpsogWgMUTx4ysAaaQ3h4ZawiiW2asRGJsuTYDhZLkhISsIgrAjCfQCGCdaZc5zjSoBGk1F4bB019fn9CT617dx0aWpHwj2L9Vjc2Zd2DrLorA1s9LJnrZdnQw006NbRu1qOtGsgyPjrIVOUlRI1mCPhqJgv80hnH1qw49q8JzN5hrZhZS3QDYRMnapsFJg2CxrNBUK6kf664IShMuvcXBIb6GY3tBuGZcEFZUUmme3WDSGqSSqZ-kxY6mdNuksIieoZmZ9bk4z5pZPG1mC5PPRb9f3-gLAx5WETItnCxawLF4DcKjSeka4GYWmqb6kv7cHCZ6zPvHxNYb-m6u9LXyJqV8Kc-kujHl4dHZ3zvxvYkK_1LS5xHzj0kTdHPJlTIhxvtM6QCnSQxqrH6PK9j5260XJaEfRat8F-RJkLCV5lrADsfnz5-e791vZoa-a7V0v9lJlL8zO_H9cZk1H4b0fO7T9d5E_2eG0hry5YgJ9jwkwaptxO6odW3jnD7iW2CWtOkaR2zcmNH3_OditvxAgbhFA1pQuNhE1PdWx12UQrxJ_JDSwIt8FmbBNgXvhb4kOYQhy1aCpSCUMcGhtIITsSxwjbas-I56lHqxH3tBSDfbdZKkL0HE4pTFcZr7OU6XUGKAr40ea9kUq2ZnVUrbQiFQYECrEYjDNQ67YG_M8GetPkqk4Dk7Afcp3a6s-J1V_z-sEIaZ">