<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/100371>100371</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [AArch64] The `umlal` instruction that cannot be executed in parallel?
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            backend:AArch64
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          DianQK
      </td>
    </tr>
</table>

<pre>
    The following IR has had its instruction order altered after `reassociate`:

```llvm
target datalayout = "e-m:o-i64:64-i48:128-n32:64-S128"
target triple = "arm64-apple-macosx11.0.0"

define <2 x i64> @src(ptr %arg, ptr %arg1, i64 noundef %arg2, <2 x i64> %arg3, <2 x i64> %arg4, <2 x i64> %arg5, <4 x i32> %arg6, <4 x i32> %arg7) {
bb:
  %i = shufflevector <4 x i32> %arg6, <4 x i32> poison, <2 x i32> <i32 0, i32 1>
  %i8 = shufflevector <4 x i32> %arg7, <4 x i32> poison, <2 x i32> <i32 0, i32 1>
  %i9 = tail call <2 x i64> @llvm.aarch64.neon.umull.v2i64(<2 x i32> %i, <2 x i32> %i8)
 %i10 = add <2 x i64> %i9, %arg5
  %i11 = shufflevector <4 x i32> %arg6, <4 x i32> poison, <2 x i32> <i32 2, i32 3>
  %i12 = shufflevector <4 x i32> %arg7, <4 x i32> poison, <2 x i32> <i32 2, i32 3>
  %i13 = tail call <2 x i64> @llvm.aarch64.neon.umull.v2i64(<2 x i32> %i11, <2 x i32> %i12)
  %i14 = add <2 x i64> %i13, %arg5
  ; tail call void asm sideeffect alignstack "/* ${0} */", "w,~{cc},~{memory}"(<2 x i64> %i10)
  ; tail call void asm sideeffect alignstack "/* ${0} */", "w,~{cc},~{memory}"(<2 x i64> %i14)
  %i15 = add <2 x i64> %i10, %arg3
  %i16 = add <2 x i64> %i14, %arg4
  %i17 = mul <2 x i64> %i15, %i16
  ret <2 x i64> %i17
}
```

The changes in the assembly instructions are as follows:
```
; origin
        umlal2.2d       v2, v3, v4
 umlal.2d        v5, v3, v4
        add.2d  v1, v2, v1
        add.2d v0, v5, v0
; after reassociate
        add.2d  v0, v2, v0
 add.2d  v1, v2, v1
        umlal.2d        v0, v3, v4
        umlal2.2d v1, v3, v4
```

The performance of the altered instruction order has significantly decreased on the Apple M1. (I am not sure if this is also the case for other ARM processors.)
My immature guess is that the `add` instruction is preventing the parallel execution of `umlal`. Perhaps we need an `llvm.aarch64.neon.umlal.*` intrinsic?

Here's a real example in Rust: https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/Is.20instruction.20ordering.20something.20to.20file.20issues.20about.3F/near/453056084
C: https://github.com/Cyan4973/xxHash/blob/a57f6cce2698049863af8c25787084ae0489d849/xxhash.h#L5312-L5323
Godbolt: https://llvm.godbolt.org/z/oeKqn19ff
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzMV91u2zoSfhr6hrBADfXnC1848fGe4rTAbrcvMKYoi1uK9JKUk_Rin31BSo6dxAl60QInCAyT8z_fDGeM3quDkXJNyjtSbhc4ht669Vah-ddfi71tn9bfekk7q7V9UOZAP32lPXraY0tV8FQZH9wogrKGWtdKR1EH6WRLsQvSUVIxJ9F7KxQGSSpG-IawLWHnz4pN_1qfhukqoDvIQFsMqPHJjoESvqUEQC4Hwjd2qaqC8E1VLFXREL7JoVkaDtPVv3NoCMALTcGpo5ZnLeiGqlji8ajlckBh_WOeZyxjz1LTZys7ZaLQPdBHmkz-QUnBvBMEmmNwlECJ7kDgnl5OeTyqqqDGjqaV3XwN8fqVqkTg7xGK9wjlTCgiIUZ9JlTvEWoCK0rquymu_f4ZARoZVMqL78eu0_IkRbDuZ9UfrfLWXDs6C_B7xYGylAsONCf8j2uLzU-arH-dyVUyGVBpKlDrt7DG8ssQneirIjPSmmwcRq2zE0QeaF6Zg1LdcCLGRmA1G47HnCXD2LZvwVSrpGJG9crZPP9tmMA5QfxVgnL4baC8b5P_elTy_DYuOVyAmS6KD5DJ-S1o-N2VryerWop-oF61UnadFIGiVgfjA4rv8aEhsCMQy6Ag9R0jdXx9Nulyeg4AHgjc_4_Ud0KQejt_H-Rg3VM6w1WEV76x60j-Jj4Vr7NbfpRddskufyFVfSRVXKSKF1J1khrGNwWUHJmlVF6dhZwMtzjr-fmvt69G0_VYiMNQ9GgOMs4-GnpJ0Xs57PXT9Sz0FF2kzIPTX6beK6X8jlqnDsqcnZv-xkGjhgza-XxK8JxSWZ7O0SemCw89lTeY5j9s28R5Sv0xq8tv8pwSPLOyKz-ngX49zW9bYFcWZvGfMv8mHPZ-OJf8zBpfcL2L3FG6zroBjZDUdhN687rydpGJa05cjlSnBJqgn2grRQxfttRO0G_iFkG_5Bkl0HyiOFBjA_Wjk1RF9cpT5SlqbxO7QB9XKUdt6KWjm69f6NFZIb23zmfPLfTliaphwBDVHEbpk5LQY0hKSMWwbUnFXrisPD06eZImxCUt8h3RodZSU_koxTjF1UXplDtSsYz-U7oej54-SGpkXNkMnTex149uRCa-FMlqcMp4JQjfXef3T-kkgdpTjDUSzeIQs6MM_Tr6QPiG9iEcUyvEB2fnRh-WGs0h-zFqdRQ9hkzYIRG5QefsA4GdD05ivMybum7YMiyFHY5KS5fB7uGwTHsj7II9KkFg98lnwK4SkwFLaCpzyIB5O8jQT9-DzYB1Ssso4P0ooyTu7RgyHt0zEh2BXVFyVlasmUvr_m0cBxX6cT-7fv-EpljVnMDu8fFP9D2B3V7bPYEdlnVXCSGhWjWsWDUVx64RUNZNzZoCJSuaVdsUqyTao--zngD_XPIclp9LDvNT-Q_b7q2-kc-E22GiZjZuprsfBHZW_vVfk6-6btGuebviK1zIdV5D3kANDVv06wp4zuqWC4miaxtWYte0K-R1VYm8wnqh1sCgYDUUOYccWCYB2v0eZVV1K4BVSwomB1Q6Sz5Yd1ikjK5zxnidLzTupfbpJwbAHsV3aVrCN5tNqrE4VMrtwq2j8HI_Hnyc_MoHf1EXVNDpN8pZptzSb1MznMv5RTukbhFoYjvu5dwCqcuf-4Lw3WJ0ev0BmHNtJb-Ozv5HikBgNxVLrMgputMa_h8AAP__xzi15A">