<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/146407>146407</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[RISCV] Code Size Increase on SPEC with -mcpu=spacemit-x60 caused by PR 144564
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
mikhailramalho
</td>
</tr>
</table>
<pre>
This is an issue to track the code size regression caused by the scheduling model changes, from PR #144564.
The results: https://lnt.lukelau.me/db_default/v4/nts/674?compare_to=673
After updating SpacemiT X60 scheduling model with hardware-measured latencies, the code size increased due to extra vector register spills. The following reduced test from Blender shows the issue:
```
target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
target triple = "riscv64-unknown-linux-gnu"
define fastcc <vscale x 64 x i8> @do_cross_effect_byte(ptr %rect2, i16 %0, i16 %1, <vscale x 16 x i16> %2) {
entry:
%3 = insertelement <vscale x 16 x i16> zeroinitializer, i16 %1, i64 0
%4 = shufflevector <vscale x 16 x i16> %3, <vscale x 16 x i16> zeroinitializer, <vscale x 16 x i32> zeroinitializer
%5 = insertelement <vscale x 16 x i16> zeroinitializer, i16 %0, i64 0
%strided.vec142 = tail call { <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8> } @llvm.vector.deinterleave4.nxv64i8(<vscale x 64 x i8> zeroinitializer)
%6 = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8> } %strided.vec142, 0
%7 = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8> } %strided.vec142, 1
%8 = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8> } %strided.vec142, 2
%9 = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8> } %strided.vec142, 3
%10 = zext <vscale x 16 x i8> %6 to <vscale x 16 x i16>
%11 = mul <vscale x 16 x i16> %4, %10
%wide.vec143 = load <vscale x 64 x i8>, ptr %rect2, align 1
%strided.vec144 = tail call { <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8> } @llvm.vector.deinterleave4.nxv64i8(<vscale x 64 x i8> %wide.vec143)
%12 = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8> } %strided.vec144, 0
%13 = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8> } %strided.vec144, 1
%14 = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8> } %strided.vec144, 2
%15 = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8> } %strided.vec144, 3
%16 = zext <vscale x 16 x i8> %12 to <vscale x 16 x i16>
%17 = mul <vscale x 16 x i16> %2, %16
%18 = add <vscale x 16 x i16> %17, %11
%19 = trunc <vscale x 16 x i16> %18 to <vscale x 16 x i8>
%20 = zext <vscale x 16 x i8> %7 to <vscale x 16 x i16>
%21 = mul <vscale x 16 x i16> %4, %20
%22 = zext <vscale x 16 x i8> %13 to <vscale x 16 x i16>
%23 = mul <vscale x 16 x i16> %2, %22
%24 = add <vscale x 16 x i16> %23, %21
%25 = trunc <vscale x 16 x i16> %24 to <vscale x 16 x i8>
%26 = zext <vscale x 16 x i8> %8 to <vscale x 16 x i16>
%27 = mul <vscale x 16 x i16> %4, %26
%28 = zext <vscale x 16 x i8> %14 to <vscale x 16 x i16>
%29 = mul <vscale x 16 x i16> %2, %28
%30 = add <vscale x 16 x i16> %29, %27
%31 = trunc <vscale x 16 x i16> %30 to <vscale x 16 x i8>
%32 = zext <vscale x 16 x i8> %9 to <vscale x 16 x i16>
%33 = mul <vscale x 16 x i16> %4, %32
%34 = zext <vscale x 16 x i8> %15 to <vscale x 16 x i16>
%35 = mul <vscale x 16 x i16> %5, %34
%36 = add <vscale x 16 x i16> %35, %33
%37 = trunc <vscale x 16 x i16> %36 to <vscale x 16 x i8>
%interleaved.vec145 = tail call <vscale x 64 x i8> @llvm.vector.interleave4.nxv64i8(<vscale x 16 x i8> %19, <vscale x 16 x i8> %25, <vscale x 16 x i8> %31, <vscale x 16 x i8> %37)
ret <vscale x 64 x i8> %interleaved.vec145
}
; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare { <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8> } @llvm.vector.deinterleave4.nxv64i8(<vscale x 64 x i8>) #0
; Function Attrs: nocallback nofree nosync nounwind willreturn memory(none)
declare <vscale x 64 x i8> @llvm.vector.interleave4.nxv64i8(<vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8>, <vscale x 16 x i8>) #0
; uselistorder directives
uselistorder ptr @llvm.vector.deinterleave4.nxv64i8, { 1, 0 }
attributes #0 = { nocallback nofree nosync nounwind willreturn memory(none) }
```
After the patch, we start to see extra `vs4r.v` and `vl4r.v`.
While debugging the issue, I've tried a couple of things:
1. Increase the latency of vector ld/st instructions: **Didn't fix the issue**. Even going as high as 1000 cycles of latencies doesn't change the final generated code.
2. Adding ReleaseAtCycle to all vector integer instructions, which scales with LMUL: **it fixes the issue**. I'm investigating why. I got this idea from looking at the P400 scheduling model.
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzUWV1v2zoS_TX0y8CGROrDfvCDk9RAgbtA0XQ_3gKaHEvc0KRBUnacX78gZTvO51X3FkUuUNSKJA7PnDM8okbce9UYxDkpr0h5M-JdaK2bb9R9y5V2fMN1a0crKw_zH63yoDxwA8r7DiFYCI6LewgtgrASwatHBIeNQ--VNSB451HC6pBu8aJF2WllGthYiRpEy02DntBrWDu7gW_fgVCWF0VZFROSLUi2-NHGgL7TwRO2gDaEbTwgdEnoUpsw0d09at5NNkjoUq7uJK55pwOhy11B6NIET-iyqgvClsJuttzhXbCE3VQ162dYrAM66LaSh4jsdssFbtQP-E-VvUa8V6GFlju55w7HG-S-cyhB84BGqD6V52woIxzyyILsKcOH4DjsUATrIlfKx_n9VmntJxDzXVut7T5O6lB2AiUE9KGn6EqjkfH-1u59mippESnJFqTKjv-yReCuwQCSB675wXYBCLsBQimON4QtcLwlbFEV6b-xOh_kdErYIqfTsWG0P3cbz1H6FDM4tdV4iueUF7uqGHfm3ti9GWtluodxY7p-DMkWEtfKIKy5D0IAYdc7L7hGeICqgAdQU8K-ACkyae-Es97f4XqNItytDgEJnW6DA0JLhyLQyK_Kq_h3dnGcx-PLwHkVA-dVikxLSugMSH1FsgWa4A49XRAvsZSHMh5dQI0bNOHdSI_orDIqKK7VI7qXAFRVQHaKW6S4vu3Wa41HtT9AyD7K4I15X98a5Xp96xFN-UuyzF5l6YNTEuVkhyIvaJokcKVBcK0j4W9MEtV-O4P_9wqQ-ibWj9a7zaSneiJRmYBOI99hMTEPu6pQUxKL-83qe5Xz7JRhlZJKi1aEHdcd_ua8XpIc7z_zX39GdPkJ3fQzoqMndLPPiI6d0OVZgveID2-t1OnRNqr4RHlnIZ8j5SnSptMfGVCR0MZ5T-P2SmKPrPdIbbl8x73j2Jc2zbVqTF8LLzMt_r5G8ZyXC5_I6Scrp-KZUeTsM8I7O0VefEZ4Z6vIy88I78krqiFekdMhZlEPMQt6MovqPK73ei7lR-Py-jTwSfnehoPrjPhw6PRt9NML8HSQZ9YDaKA_5Zn0vMwoHaQEGwKB_YwSaa-dDoohQlB2GncWgpbDhKDFACEGFeQ7ij5nYVA9noU41yOdDhLinVyeQ5j9lBDToxAsGyTE7DSuPr-R5MOEYNmfC8EG1eNsAAtsUDmehGBn62TFICHKIRDKIRDKE4TiJEQ1RAh2Hne2VVYPFOKdXdilEE-bjKODly_3QO--Fl_uVv58r_KC19lHD5a46D--zt55sz5fr09bIIfh3RzeSp9kC1Lf9A0Cwq5g2RkRlDWwCMGlXo-xkZgVF_dg7NohgrH-YAQY25m9MhL2SmuHoXMGNrix7kDo1FiDPSaJQnP3ux_Uf2V3mboUlGW_hZVfXm-_ls3XXHQetfLBOokOpIovGmqHnmSLZ1fSe8ggFa5TaaQKz-BcjDwEp1ZdQJ_m79tc9dVfI_4U_qJB1_cdQ4uw5UG0EcUewQfuQrQTj3hsFZIq2_nCTXakyoAbmU7o44ljm_TfrdIIEldd0yjTXDQG6TV8JbTeIQSnUAIHYbutRrBrCK0yje-bYfkEvh5blWl039I8xNuO3SstCV36AMr44LpUlqkiCV0QurhR0hBaB1irh8vp47UJfNmhgcZGaNxDq5o2_uZZloE4CI0-znPuooK06PtofYs4BVwrwzU0aNDxgDL1WGP6dAILKWPo76gj_kW4jjEji9FZj_BjCTTonsOPpLdKtJAK0Pf93T_-8c8_nhJTKSX0r5OKvG5AmR36oJq-fbxvDxP4Co0NkV0PSiLvO7fa2vuUf0iRvhXZ6_7yZCTnTM7YjI9wntdlzvKMZsWonReinjExLatqNStn1ZoKzgrJ67WY1euK5iM1pxkts4pl-SxjNJ8U01m9LmaSlfmq4FSSIsMNV3qSVoZ1zSglM8-LqsjqkeYr1D59BqDU4P6UKiXlzcjN46Dxqmt8XFrKB_8UJqig0_eD719vr_9Fyhu4thLhVj3iU0lZA7ffvlz3BI83YtsRduP7XnsYP1TZxbeCb9-h_wYw6pyeP2_5Nyq03Woi7IbQZYRw_Blvnf0vikDoMgH3hC6Pme3m9H8BAAD__-YwPqU">