<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/100371>100371</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[AArch64] The `umlal` instruction that cannot be executed in parallel?
</td>
</tr>
<tr>
<th>Labels</th>
<td>
backend:AArch64
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
DianQK
</td>
</tr>
</table>
<pre>
The following IR has had its instruction order altered after `reassociate`:
```llvm
target datalayout = "e-m:o-i64:64-i48:128-n32:64-S128"
target triple = "arm64-apple-macosx11.0.0"
define <2 x i64> @src(ptr %arg, ptr %arg1, i64 noundef %arg2, <2 x i64> %arg3, <2 x i64> %arg4, <2 x i64> %arg5, <4 x i32> %arg6, <4 x i32> %arg7) {
bb:
%i = shufflevector <4 x i32> %arg6, <4 x i32> poison, <2 x i32> <i32 0, i32 1>
%i8 = shufflevector <4 x i32> %arg7, <4 x i32> poison, <2 x i32> <i32 0, i32 1>
%i9 = tail call <2 x i64> @llvm.aarch64.neon.umull.v2i64(<2 x i32> %i, <2 x i32> %i8)
%i10 = add <2 x i64> %i9, %arg5
%i11 = shufflevector <4 x i32> %arg6, <4 x i32> poison, <2 x i32> <i32 2, i32 3>
%i12 = shufflevector <4 x i32> %arg7, <4 x i32> poison, <2 x i32> <i32 2, i32 3>
%i13 = tail call <2 x i64> @llvm.aarch64.neon.umull.v2i64(<2 x i32> %i11, <2 x i32> %i12)
%i14 = add <2 x i64> %i13, %arg5
; tail call void asm sideeffect alignstack "/* ${0} */", "w,~{cc},~{memory}"(<2 x i64> %i10)
; tail call void asm sideeffect alignstack "/* ${0} */", "w,~{cc},~{memory}"(<2 x i64> %i14)
%i15 = add <2 x i64> %i10, %arg3
%i16 = add <2 x i64> %i14, %arg4
%i17 = mul <2 x i64> %i15, %i16
ret <2 x i64> %i17
}
```
The changes in the assembly instructions are as follows:
```
; origin
umlal2.2d v2, v3, v4
umlal.2d v5, v3, v4
add.2d v1, v2, v1
add.2d v0, v5, v0
; after reassociate
add.2d v0, v2, v0
add.2d v1, v2, v1
umlal.2d v0, v3, v4
umlal2.2d v1, v3, v4
```
The performance of the altered instruction order has significantly decreased on the Apple M1. (I am not sure if this is also the case for other ARM processors.)
My immature guess is that the `add` instruction is preventing the parallel execution of `umlal`. Perhaps we need an `llvm.aarch64.neon.umlal.*` intrinsic?
Here's a real example in Rust: https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/Is.20instruction.20ordering.20something.20to.20file.20issues.20about.3F/near/453056084
C: https://github.com/Cyan4973/xxHash/blob/a57f6cce2698049863af8c25787084ae0489d849/xxhash.h#L5312-L5323
Godbolt: https://llvm.godbolt.org/z/oeKqn19ff
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzMV91u2zoSfhr6hrBADfXnC1848fGe4rTAbrcvMKYoi1uK9JKUk_Rin31BSo6dxAl60QInCAyT8z_fDGeM3quDkXJNyjtSbhc4ht669Vah-ddfi71tn9bfekk7q7V9UOZAP32lPXraY0tV8FQZH9wogrKGWtdKR1EH6WRLsQvSUVIxJ9F7KxQGSSpG-IawLWHnz4pN_1qfhukqoDvIQFsMqPHJjoESvqUEQC4Hwjd2qaqC8E1VLFXREL7JoVkaDtPVv3NoCMALTcGpo5ZnLeiGqlji8ajlckBh_WOeZyxjz1LTZys7ZaLQPdBHmkz-QUnBvBMEmmNwlECJ7kDgnl5OeTyqqqDGjqaV3XwN8fqVqkTg7xGK9wjlTCgiIUZ9JlTvEWoCK0rquymu_f4ZARoZVMqL78eu0_IkRbDuZ9UfrfLWXDs6C_B7xYGylAsONCf8j2uLzU-arH-dyVUyGVBpKlDrt7DG8ssQneirIjPSmmwcRq2zE0QeaF6Zg1LdcCLGRmA1G47HnCXD2LZvwVSrpGJG9crZPP9tmMA5QfxVgnL4baC8b5P_elTy_DYuOVyAmS6KD5DJ-S1o-N2VryerWop-oF61UnadFIGiVgfjA4rv8aEhsCMQy6Ag9R0jdXx9Nulyeg4AHgjc_4_Ud0KQejt_H-Rg3VM6w1WEV76x60j-Jj4Vr7NbfpRddskufyFVfSRVXKSKF1J1khrGNwWUHJmlVF6dhZwMtzjr-fmvt69G0_VYiMNQ9GgOMs4-GnpJ0Xs57PXT9Sz0FF2kzIPTX6beK6X8jlqnDsqcnZv-xkGjhgza-XxK8JxSWZ7O0SemCw89lTeY5j9s28R5Sv0xq8tv8pwSPLOyKz-ngX49zW9bYFcWZvGfMv8mHPZ-OJf8zBpfcL2L3FG6zroBjZDUdhN687rydpGJa05cjlSnBJqgn2grRQxfttRO0G_iFkG_5Bkl0HyiOFBjA_Wjk1RF9cpT5SlqbxO7QB9XKUdt6KWjm69f6NFZIb23zmfPLfTliaphwBDVHEbpk5LQY0hKSMWwbUnFXrisPD06eZImxCUt8h3RodZSU_koxTjF1UXplDtSsYz-U7oej54-SGpkXNkMnTex149uRCa-FMlqcMp4JQjfXef3T-kkgdpTjDUSzeIQs6MM_Tr6QPiG9iEcUyvEB2fnRh-WGs0h-zFqdRQ9hkzYIRG5QefsA4GdD05ivMybum7YMiyFHY5KS5fB7uGwTHsj7II9KkFg98lnwK4SkwFLaCpzyIB5O8jQT9-DzYB1Ssso4P0ooyTu7RgyHt0zEh2BXVFyVlasmUvr_m0cBxX6cT-7fv-EpljVnMDu8fFP9D2B3V7bPYEdlnVXCSGhWjWsWDUVx64RUNZNzZoCJSuaVdsUqyTao--zngD_XPIclp9LDvNT-Q_b7q2-kc-E22GiZjZuprsfBHZW_vVfk6-6btGuebviK1zIdV5D3kANDVv06wp4zuqWC4miaxtWYte0K-R1VYm8wnqh1sCgYDUUOYccWCYB2v0eZVV1K4BVSwomB1Q6Sz5Yd1ikjK5zxnidLzTupfbpJwbAHsV3aVrCN5tNqrE4VMrtwq2j8HI_Hnyc_MoHf1EXVNDpN8pZptzSb1MznMv5RTukbhFoYjvu5dwCqcuf-4Lw3WJ0ev0BmHNtJb-Ozv5HikBgNxVLrMgputMa_h8AAP__xzi15A">