<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/139198>139198</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            AMDGPU missed opportunity: 2 x v_mov_b32 -> v_mov_b64
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          newling
      </td>
    </tr>
</table>

<pre>
    Consider the following IR (input.ll). In particular I'm focusing on what `zeroinitializer` gets lowered to by llc. 

```
define amdgpu_kernel void @main(ptr addrspace(1) %out_ptr) {
entry:
  br label %loop

loop: ; preds = %loop, %entry
  %vec = phi <32 x float> [ zeroinitializer, %entry ], [ %vec.next, %loop ]
  %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
  %element = extractelement <32 x float> %vec, i32 %i
  %add = fadd float %element, 1.0
  %vec.next = insertelement <32 x float> %vec, float %add, i32 %i
  %i.next = add nuw nsw i32 %i, 1
  %exitcond = icmp eq i32 %i.next, 16
  br i1 %exitcond, label %store_result, label %loop

store_result:                                     ; preds = %loop
 %ptr = getelementptr float, ptr addrspace(1) %out_ptr, i64 0
  store <32 x float> %vec.next, ptr addrspace(1) %ptr, align 64
  ret void
}
```

Running 
```
llc  -mtriple=amdgcn -mcpu=gfx942  input.ll 
```

the generated assembly is:

```
        .text
        .globl  main ; -- Begin function main
        .p2align        8
        .type   main,@function
main: ; @main
; %bb.0:                                ; %entry
        v_mov_b32_e32 v0, 0
        s_mov_b32 s0, 0
        v_mov_b32_e32 v1, v0
        v_mov_b32_e32 v2, v0
        v_mov_b32_e32 v3, v0
        v_mov_b32_e32 v4, v0
        v_mov_b32_e32 v5, v0
[...]
        v_mov_b32_e32 v26, v0
        v_mov_b32_e32 v27, v0
        v_mov_b32_e32 v28, v0
        v_mov_b32_e32 v29, v0
        v_mov_b32_e32 v30, v0
        v_mov_b32_e32 v31, v0
.LBB0_1:                                ; %loop
                                        ; =>This Inner Loop Header: Depth=1
        s_set_gpr_idx_on s0, gpr_idx(SRC0)
        v_mov_b32_e32 v32, v0
        s_set_gpr_idx_off
        v_add_f32_e32 v32, 1.0, v32
        s_set_gpr_idx_on s0, gpr_idx(DST)
```

The optimization I have in mind is to combine consecutive v_mov_b32_e32 instructions, to arrive at something like 

```
[...]
        v_mov_b64_e32 v[2:3], v[0:1]
        v_mov_b64_e32 v[4:5], v[0:1]
        v_mov_b64_e32 v[6:7], v[0:1]
[...]
        v_mov_b64_e32 v[30:31], v[0:1]
```

making use of the 2-register move instruction for mi300 ( search for "V_MOV_B64" in https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf )

I wonder where such an optimization might live. Would it be standalone pass like `GCNDPPCombine.cpp` or should it be a pattern in `AMDGPUPostLegalizerCombiner.cpp`? If I can have some guidance on this, I'd be happy to give it a try. 

[Please let me know if this task doesn't make sense, I'm quite new here and would like to learn]]
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJyUV19z4jgS_zTipQuXkG3ADzzwZ7OXqtm7qdm5vUdKSG1bF1nySjIk8-mvJAwJuQybTVEVW9396z_qX9Nw71VjEFek3JByN-FDaK1bGTxpZZrJwcqX1dYaryQ6CC1CbbW2J2UaePwGhC2V6YeQaU1YlcGjgZ67oMSguYNHwhYd1FYMPupbA6eWByBz-gOdVUYFxbX6gY7MKTQYPGh7QocSgoXDC2gtMiB0HT9zOn7oWmKtDALvZNMP-yd0BjUcrZJACtpxZQhb9sEBl9L5ngskbDkjrALCSjuEfR9celtsCF2jCe6F5NEHwMGB5gfUUVNb2599p6d8DSTfQO9QeiD57qrCtvHxDBMxCCuPKJJK3yog-TZn8Ay1tjyQ_Bcg5Qbep_8GA0i5S-_lZoTKDD6HUSV6TBqjJ3X1o3KWbOgdMHUPCjV2aEICxOfguAivR-9zSIFFoOSWleqCwqVMCHV8SPpvsKPBLKNvypQCSgbKeHSf8XgF5VJ-FIF6xYwxmOEExp9e1WIM15yfVRDWnENWousB_7xqXos1m1-aQ83eWkXZtV18sA73Dv2gw43gtY9uVPI1fObv456j6_gYezyeNnipWzw5l4xt4S8psAU1L-B8HSm2n5X9WomfQY5wXKvGwLxIiA5DImXMfbF7R2FC198GY-JUeCfRWgBMu-BUr5Hku8hyYWDaiX4g-a6pn6uCAVymzntzQtdxSDVo0PGAErj32B30Cyh_Zvn_GVRZiOmdHxttD5rQKo6RVPzpFDbYKAP1YERQ1kCaMGftnqWUCa2WF6iXHkdzwrakoBczQtfpcJwjl0lF1-mNlYdDRj_RFKP2ZdwQWh33nT3uDznbY87gmPg_JuYvIvBvj99ZzKLo-LGM3ZHld2TFHVn5Kis3WZadZ9AHvuf3AlvcEy7vCat7OdF7wtdKZV82G7qfff7CLrT95F-yynck_-V7qzw8GoMOvsSJ_Q_kEl10vMM-tCTfzS6X7THsm97tlXzeWzPe-XhA2PL3b1tKWPVxajcX_Q6qri82XMp9fWsT53k0zdknw9j9_n2M4pa231sE2wfVqR888ewRWn5EUAY6ZSQoH7cCYbtD_PIX1ngUQ1BHhNtclPHBDYlzPnoOFrhzUY8H8LbD0Maxo9UTfrBbfNyS8-KcMik3jOTrfPxaje-RtLM7-gXJ1-Xf0J-TfL34if5fxZZH5Xz2E-vbcnf8KZZh8Ai2TosdmzpslA_ooLOp8NdCQm0ddCqnNG584JE70aZDwtgf-9_-9cd-My8Ii9WHNoQ-zVr2QNjD6XTKeCczYTvCHoQ1IS0CD5LHd95Jwh7QxAMrhvgd5gl7iK6VEWEaULRTacXlcIxn6jFMYwwqoAiDQ3_Gml4NU7BTIQ3Pp_cMs17WcGHF-hFO1sRF99SiQ_CDaIGb27bsVNMG0OqIGfzHDlqCCnBA8IEbybU1CD33fmywOf11-8_d16_bc99mou_jsmsd-PaNMYeeh4DOxAKSOV3_tvv167-_Wh--YHNeEkcEN0KQ_AEea3gEwc2ZKLG3oRmU5EZgXLZDqxID4houo5eW9_1LJEQT6aACcAju5bpil5uvGrlH0BigQ3gy9gSqTjgQuH8CadEbwhYBOv6E4NF4vHjo4M9BBQSDJ0jV40bCKeWYShEsaOTOxG4sdxO5ymWVV3yCq9mimC_KZUHnk3aFVXWYVfWyLqrZki7yasEKWdNFjjWW1UxM1IpRVtKSVpQVlM6zoippJeolr0rGl0KQgmLHlc60PnaZdc1EeT_gapZXs2o5SYuZTz93GIvBJilhLEblVtFoehgaTwqqlQ_-FSaooHF1vhrolPcowfa9dWEwKsQfERCXp-s4gmlcoa40nQxOr27J0ajQDoeRG9HP-G_aO_tfFJEnKbrY3mP4xxX7XwAAAP__0uUM1g">