<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/139198>139198</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
AMDGPU missed opportunity: 2 x v_mov_b32 -> v_mov_b64
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
newling
</td>
</tr>
</table>
<pre>
Consider the following IR (input.ll). In particular I'm focusing on what `zeroinitializer` gets lowered to by llc.
```
define amdgpu_kernel void @main(ptr addrspace(1) %out_ptr) {
entry:
br label %loop
loop: ; preds = %loop, %entry
%vec = phi <32 x float> [ zeroinitializer, %entry ], [ %vec.next, %loop ]
%i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
%element = extractelement <32 x float> %vec, i32 %i
%add = fadd float %element, 1.0
%vec.next = insertelement <32 x float> %vec, float %add, i32 %i
%i.next = add nuw nsw i32 %i, 1
%exitcond = icmp eq i32 %i.next, 16
br i1 %exitcond, label %store_result, label %loop
store_result: ; preds = %loop
%ptr = getelementptr float, ptr addrspace(1) %out_ptr, i64 0
store <32 x float> %vec.next, ptr addrspace(1) %ptr, align 64
ret void
}
```
Running
```
llc -mtriple=amdgcn -mcpu=gfx942 input.ll
```
the generated assembly is:
```
.text
.globl main ; -- Begin function main
.p2align 8
.type main,@function
main: ; @main
; %bb.0: ; %entry
v_mov_b32_e32 v0, 0
s_mov_b32 s0, 0
v_mov_b32_e32 v1, v0
v_mov_b32_e32 v2, v0
v_mov_b32_e32 v3, v0
v_mov_b32_e32 v4, v0
v_mov_b32_e32 v5, v0
[...]
v_mov_b32_e32 v26, v0
v_mov_b32_e32 v27, v0
v_mov_b32_e32 v28, v0
v_mov_b32_e32 v29, v0
v_mov_b32_e32 v30, v0
v_mov_b32_e32 v31, v0
.LBB0_1: ; %loop
; =>This Inner Loop Header: Depth=1
s_set_gpr_idx_on s0, gpr_idx(SRC0)
v_mov_b32_e32 v32, v0
s_set_gpr_idx_off
v_add_f32_e32 v32, 1.0, v32
s_set_gpr_idx_on s0, gpr_idx(DST)
```
The optimization I have in mind is to combine consecutive v_mov_b32_e32 instructions, to arrive at something like
```
[...]
v_mov_b64_e32 v[2:3], v[0:1]
v_mov_b64_e32 v[4:5], v[0:1]
v_mov_b64_e32 v[6:7], v[0:1]
[...]
v_mov_b64_e32 v[30:31], v[0:1]
```
making use of the 2-register move instruction for mi300 ( search for "V_MOV_B64" in https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf )
I wonder where such an optimization might live. Would it be standalone pass like `GCNDPPCombine.cpp` or should it be a pattern in `AMDGPUPostLegalizerCombiner.cpp`? If I can have some guidance on this, I'd be happy to give it a try.
[Please let me know if this task doesn't make sense, I'm quite new here and would like to learn]]
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJyUV19z4jgS_zTipQuXkG3ADzzwZ7OXqtm7qdm5vUdKSG1bF1nySjIk8-mvJAwJuQybTVEVW9396z_qX9Nw71VjEFek3JByN-FDaK1bGTxpZZrJwcqX1dYaryQ6CC1CbbW2J2UaePwGhC2V6YeQaU1YlcGjgZ67oMSguYNHwhYd1FYMPupbA6eWByBz-gOdVUYFxbX6gY7MKTQYPGh7QocSgoXDC2gtMiB0HT9zOn7oWmKtDALvZNMP-yd0BjUcrZJACtpxZQhb9sEBl9L5ngskbDkjrALCSjuEfR9celtsCF2jCe6F5NEHwMGB5gfUUVNb2599p6d8DSTfQO9QeiD57qrCtvHxDBMxCCuPKJJK3yog-TZn8Ay1tjyQ_Bcg5Qbep_8GA0i5S-_lZoTKDD6HUSV6TBqjJ3X1o3KWbOgdMHUPCjV2aEICxOfguAivR-9zSIFFoOSWleqCwqVMCHV8SPpvsKPBLKNvypQCSgbKeHSf8XgF5VJ-FIF6xYwxmOEExp9e1WIM15yfVRDWnENWousB_7xqXos1m1-aQ83eWkXZtV18sA73Dv2gw43gtY9uVPI1fObv456j6_gYezyeNnipWzw5l4xt4S8psAU1L-B8HSm2n5X9WomfQY5wXKvGwLxIiA5DImXMfbF7R2FC198GY-JUeCfRWgBMu-BUr5Hku8hyYWDaiX4g-a6pn6uCAVymzntzQtdxSDVo0PGAErj32B30Cyh_Zvn_GVRZiOmdHxttD5rQKo6RVPzpFDbYKAP1YERQ1kCaMGftnqWUCa2WF6iXHkdzwrakoBczQtfpcJwjl0lF1-mNlYdDRj_RFKP2ZdwQWh33nT3uDznbY87gmPg_JuYvIvBvj99ZzKLo-LGM3ZHld2TFHVn5Kis3WZadZ9AHvuf3AlvcEy7vCat7OdF7wtdKZV82G7qfff7CLrT95F-yynck_-V7qzw8GoMOvsSJ_Q_kEl10vMM-tCTfzS6X7THsm97tlXzeWzPe-XhA2PL3b1tKWPVxajcX_Q6qri82XMp9fWsT53k0zdknw9j9_n2M4pa231sE2wfVqR888ewRWn5EUAY6ZSQoH7cCYbtD_PIX1ngUQ1BHhNtclPHBDYlzPnoOFrhzUY8H8LbD0Maxo9UTfrBbfNyS8-KcMik3jOTrfPxaje-RtLM7-gXJ1-Xf0J-TfL34if5fxZZH5Xz2E-vbcnf8KZZh8Ai2TosdmzpslA_ooLOp8NdCQm0ddCqnNG584JE70aZDwtgf-9_-9cd-My8Ii9WHNoQ-zVr2QNjD6XTKeCczYTvCHoQ1IS0CD5LHd95Jwh7QxAMrhvgd5gl7iK6VEWEaULRTacXlcIxn6jFMYwwqoAiDQ3_Gml4NU7BTIQ3Pp_cMs17WcGHF-hFO1sRF99SiQ_CDaIGb27bsVNMG0OqIGfzHDlqCCnBA8IEbybU1CD33fmywOf11-8_d16_bc99mou_jsmsd-PaNMYeeh4DOxAKSOV3_tvv167-_Wh--YHNeEkcEN0KQ_AEea3gEwc2ZKLG3oRmU5EZgXLZDqxID4houo5eW9_1LJEQT6aACcAju5bpil5uvGrlH0BigQ3gy9gSqTjgQuH8CadEbwhYBOv6E4NF4vHjo4M9BBQSDJ0jV40bCKeWYShEsaOTOxG4sdxO5ymWVV3yCq9mimC_KZUHnk3aFVXWYVfWyLqrZki7yasEKWdNFjjWW1UxM1IpRVtKSVpQVlM6zoippJeolr0rGl0KQgmLHlc60PnaZdc1EeT_gapZXs2o5SYuZTz93GIvBJilhLEblVtFoehgaTwqqlQ_-FSaooHF1vhrolPcowfa9dWEwKsQfERCXp-s4gmlcoa40nQxOr27J0ajQDoeRG9HP-G_aO_tfFJEnKbrY3mP4xxX7XwAAAP__0uUM1g">