<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/194150>194150</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Question about canonicalizing complex multiply to an addsub-friendly vector form
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
ParkHanbum
</td>
</tr>
</table>
<pre>
I am still new to MLIR/LLVM vector lowering, so this may be a naive question.
I looked at this issue from the canonicalization/codegen side, and I wonder if the missing piece is not the final X86 backend combine, but an earlier canonicalization that exposes an addsub-friendly vector form.
For one complex multiply:
```text
real = a0*b0 - a1*b1
imag = a0*b1 + a1*b0
```
would it make sense to canonicalize this into a form like this?
```llvm
func.func @cmul_vec(%a: vector<2xf64>, %b: vector<2xf64>) -> vector<2xf64> {
%a0 = vector.shuffle %a, %a [0, 0]
: vector<2xf64>, vector<2xf64>
%a1 = vector.shuffle %a, %a [1, 1]
: vector<2xf64>, vector<2xf64>
%bs = vector.shuffle %b, %b [1, 0]
: vector<2xf64>, vector<2xf64>
%sa = arith.mulf %a0, %b : vector<2xf64>
%sb = arith.mulf %a1, %bs : vector<2xf64>
%sub = arith.subf %sa, %sb : vector<2xf64>
%add = arith.addf %sa, %sb : vector<2xf64>
%res = vector.shuffle %sub, %add [0, 3]
: vector<2xf64>, vector<2xf64>
return %res : vector<2xf64>
}
```
The corresponding LLVM IR shape would be:
```llvm
%a0 = shufflevector <2 x double> %a, <2 x double> poison,
<2 x i32> zeroinitializer
%a1 = shufflevector <2 x double> %a, <2 x double> poison,
<2 x i32> <i32 1, i32 1>
%bs = shufflevector <2 x double> %b, <2 x double> poison,
<2 x i32> <i32 1, i32 0>
%sa = fmul <2 x double> %a0, %b
%sb = fmul <2 x double> %a1, %bs
%sub = fsub <2 x double> %sa, %sb
%add = fadd <2 x double> %sa, %sb
%res = shufflevector <2 x double> %sub, <2 x double> %add,
<2 x i32> <i32 0, i32 3>
```
My understanding is that this is the shape that can expose the existing X86 addsub/fmaddsub lowering opportunities, instead of trying to recover the idiom much later from scalar extract/insert code.
I also tried checking whether this form causes obviously worse codegen on GPU backends.
For NVPTX, the vector form scalarizes cleanly. The generated PTX was:
```llvm
; custom_vector
ld.param.v2.b64 {%rd1, %rd2}, [cmul_vec_param_0];
ld.param.v2.b64 {%rd3, %rd4}, [cmul_vec_param_1];
mul.rn.f64 %rd5, %rd1, %rd4;
mul.rn.f64 %rd6, %rd1, %rd3;
mul.rn.f64 %rd7, %rd2, %rd3;
mul.rn.f64 %rd8, %rd2, %rd4;
sub.rn.f64 %rd9, %rd6, %rd8;
add.rn.f64 %rd10, %rd5, %rd7;
st.param.v2.b64 [func_retval0], {%rd9, %rd10};
ret;
```
The existing complex.mul lowering produced the same arithmetic shape, but with scalar parameter loads/stores:
```llvm
; cmul_complex
ld.param.b64 %rd1, [cmul_complex_param_0+8];
ld.param.b64 %rd2, [cmul_complex_param_0];
ld.param.b64 %rd3, [cmul_complex_param_1+8];
ld.param.b64 %rd4, [cmul_complex_param_1];
mul.rn.f64 %rd5, %rd2, %rd4;
mul.rn.f64 %rd6, %rd1, %rd3;
sub.rn.f64 %rd7, %rd5, %rd6;
mul.rn.f64 %rd8, %rd1, %rd4;
mul.rn.f64 %rd9, %rd2, %rd3;
add.rn.f64 %rd10, %rd8, %rd9;
st.param.b64 [func_retval0+8], %rd10;
st.param.b64 [func_retval0], %rd7;
ret;
```
For AMDGPU/gfx90a, the vector form also scalarized without leaving explicit shuffle/permute instructions:
```asm
; custom_vector
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
v_mul_f64 v[8:9], v[0:1], v[6:7]
v_mul_f64 v[0:1], v[0:1], v[4:5]
v_mul_f64 v[4:5], v[2:3], v[4:5]
v_mul_f64 v[2:3], v[2:3], v[6:7]
v_add_f64 v[0:1], v[0:1], -v[2:3]
v_add_f64 v[2:3], v[8:9], v[4:5]
s_setpc_b64 s[30:31]
```
```asm
; cmul_complex
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
v_mul_f64 v[8:9], v[0:1], v[4:5]
v_mul_f64 v[10:11], v[2:3], v[6:7]
v_add_f64 v[8:9], v[8:9], -v[10:11]
v_mul_f64 v[2:3], v[2:3], v[4:5]
v_mul_f64 v[0:1], v[0:1], v[6:7]
v_add_f64 v[2:3], v[2:3], v[0:1]
v_mov_b32_e32 v0, v8
v_mov_b32_e32 v1, v9
s_setpc_b64 s[30:31]
```
I also ran llvm-mca on the extracted gfx90a instruction blocks. This is not a benchmark, but I thought it was useful as a static codegen check:
```bash
custom vector form:
Iterations: 100
Instructions: 800
Total Cycles: 808
Total uOps: 800
Dispatch Width: 1
uOps Per Cycle: 0.99
IPC: 0.99
Block RThroughput: 8.0
existing complex.mul form:
Iterations: 100
Instructions: 1000
Total Cycles: 1008
Total uOps: 1000
Dispatch Width: 1
uOps Per Cycle: 0.99
IPC: 0.99
Block RThroughput: 10.0
```
So my current understanding is:
* the main motivation would still be exposing the X86 addsub/fmaddsub-friendly shape;
* the vector form does not seem to leave extra shuffle instructions in the small NVPTX/AMDGPU cases I checked;
* the generated GPU code in these small cases still maps back to the expected scalar complex multiply arithmetic.
Given this, would it be reasonable to add an MLIR-side canonicalization that lowers this complex multiply idiom to the addsub-friendly vector form shown above?
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJysWUtz2zgS_jXwpcsqPvTyQQfJHmVdlex6s57Z3Fwg0RSxBgktAEp2fv0UHpQoiYw8yahSMUGgH2j01_gAUq35pkZckMmKTB5uaGNKqRZPVL3-g9ZZU91kkr0vHoFWoA0XAmrcg5Hw5fPjV5KsP3_-4wvsMDdSgZB7VLzekOQetARTcg0VfYcMgUJN-Q7h_w1qw2U9ItGSRMtHEFK-IgNq_HCudYNQKFmBKRFyWsua51Tw79SKkWSdS4YbrEFzhtYQrRk8wl7WDBXwwolVXGteb2DLMUfgGmppXEfBayrg23wKGc1fsWaQyyrjtdOUNQZoDUiV4KgubIMpqQF820qN2g6kjOkmuy0Ux5qJ9zYKhVRVmN5aKpA1WiNbgW9QNcLwrXgn6dIPINPI_zP4Zki0VEgFkPQBaESSZRbBLdDYPsUkWvKKbjqdMZBkFbqjri6vei8bwYAbqOgrgsZao122zqwwhLw2EqhzGwR_9W9Juj7zUIhdRaJl0dT5yP4HZBzlVSNedpiTZE6SCSXpMgSBpPfJWzEdk_Q3G1mSTLKBzju4Jelvlz1AZisSLcHK0shN248Z6bIpCoGuIyinQCaryDYiMnlwYvY36M_ly9ZS_BFLsW3Ev2Yp0wOWsjZgB0s_P6fWmKY-bxQ35ahqROGjerTUq7QVzvqE41ZYD0pbBVa-6SrQTVZ4l4ICfcU8ZawjThn7sHirQeFQrHXTRttZCTmU_sLKKjSNqo9WBzybPZyhyzefbcmTSqHeyprZCubK6-NX0CXdInhMZ3hZPwI6j2AJsww1yZqHN2CyyQQ6dLUpfd6xlVzbOnt_iMDpLwjwNLGjv6OSvOaGu3qiggfx3-fBqTmS3vM0AZd7_sGH8wCn6zazv2HW525Eh3w7Yq2oGtE_6QPswvjsyvgj0o42AqQK_3Ap1IFHWJOAosI_fECiRc31kLYo6nOeseGg9kc1aqOaHqN6ipIv79DY3V4b6kHCtd-bA4NwO70HjHud203dbduuB9-4NlbM0gC_h5NkXVT-8UBiQG63UpnGZjdq51StDVIGsgCj3u0QI0FhLneonGbOuKygavISBDWoPI_RORVUAb4ZRXNDkjWvNSoDlsgcaBAVljApjgzyEvNXq31foimdaq799pzTxnIPme24bLR4h71UGqGlRLKGT0-_t9RGjwID-ecfT8_f7ASsjx2SEjzj31FDLpDW4n0EtgRtsEZFDTJ4ev4Ge6qH6026grzRRlYvoc5FS8FGW6poNdolo2w6dht5MlGszWPFElv_bGOyahnEixN5cVtduurRcsiZoC09aBsPaosP2qpGjFQ9KjqKvDarYHJQFXeUnsu519OeoekHTMw6c_8rcvMeueCabrIT1-4OA45Ozv1QytiwiTg6DD8GYhZsmIFFmKwsBXxRaHZUuEWzkmFpjp7EkV0bp0qh8Q9dMD934RhIsqUYRxBulWRNjsxjmlboeUCFhuce4y1z33NTtlhzTqNFoJCUaZKstZEKr-SxTZ7gRDcBszbCcTfLwsBD3iar-WXuhph1FnBA-oeS6bBkfN3u-EfSQ_g4S4eL5PsLuOjkaR8eJp20HXZl_iGIHsffDePtHAznIDiaujsDQRvVs-QPK9DJ-UuxAcB0oNaLj1DAl18ePj39TpL1pni7i2hfKXf7x6GeM4cG2RgQSHcWR_i2FTznpt3OSbLeoqoag25bU01uj7gtQFoXqB4q8m0l0C97yk1eG9hVeW1IMo_smQ7ftp2W2Lx2Oq307sVmo12DHZms5iRd3oWY7BwLT5dxpz0l6XLWJeVwpuBc4Lw9JulyEhScSh56wsiEpMt0QLLX9LnAebvXd8rYh3y_PdE4pOHc5Hk4e-agXzSabf5ic1OTySq1VtNwpD3LwL5sOC2VP5sFvQG9lgxXVyR2EvGvrMm5D9327ZmNn02M4Zy8ls1X3b9m-qDQm5a7lyxNXjBNYOfq4G5-OqvuAFd_d3c_m0sHrqtoDXbzva1yCu5yDVuKjAx8oeuWJsiEzF-1Jaie5NfSAIUM67ysqHptacAj2MK3KQ1wY6krNBqLRgDVQEEbanlDy5cd077kBBnVJYmWvuR1q6wfCvBoLD0O9bINQxzZOcLjaTX1fXPfB8_SUAH377nAbue809n8a9vVGiQfuN5Sk5fwX85MGfpjJ2YF4AmVVxu6otGdX6LHp_uOtvBre1c2pPD1uVQ2YtvG2KHzUVirXmI2HIYw__4AxNGPIhBHvSE4CP3q7A_t3gnH0ajn-vQ_Eqp3yBulsDYX581j0iRLf-NMeQ2VNHzn74r9TY2_MM_QHz_dibHE_nPn8RrZs1rPBYL27k7PJPrk14iVPYHaLT5gp93dT_Z04B5duqJCtIfBtecUkFN7oHz0UEB2avZ4DnRDJcOgS7favLifZkW32p09rVMezlt0aA6s_PwWvEPlwzn4E99h7e-fk3s43GBnCAqpljXNhLvEpowBrd0niFvN2eV3An_wd6cI7c_PF8b9YT24-oOrfNCl3NdAM7nD4614tLxhi5TdpXf0BhfxbDaLk3k6nt6Ui3lE8_EsTuMpLXI6yWfIimg6H0f5LKc4S2_4IomSaTROJvF4PBlPR9OoiIo4H0e0mFHECRlHWFEuRrZAjqTa3LjPIov4bhxPohtBMxTafbBJkhr3_qMJSRIyebhRC1dVs2ajyTgSXBt9VGO4Ebj4d_gGY-fUmG7sOmg_BsrG-4cfO24aJRalMQ61JFlbospN2WSjXFYkWbszlv9zu1Xyf-hvQazT9mgWZrVbJH8GAAD__zRarbU">