<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/144227>144227</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[SelectionDAG][x86] Shuffle pyramid is not eliminated
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
sutajo
</td>
</tr>
</table>
<pre>
https://godbolt.org/z/zEonMr7da
The above example illustrates the problem.
There are 2 minimum reductions in this snippet, but only the second is compiled into the efficient form:
```
vextracti128 xmm1, ymm0, 1
vpminuw xmm0, xmm0, xmm1
vphminposuw xmm0, xmm0
```
The first one is computed with shuffles and mins:
```
vextracti128 xmm2, ymm0, 1
vpminuw xmm2, xmm0, xmm2
vpshufd xmm3, xmm2, 238 # xmm3 = xmm2[2,3,2,3]
vpminuw xmm3, xmm2, xmm3
vpshufd xmm4, xmm3, 85 # xmm4 = xmm3[1,1,1,1]
vpminuw xmm3, xmm3, xmm4
vpsrld xmm4, xmm3, 16
vphminposuw xmm2, xmm2
```
Normally, `llvm.vector.reduce.umin.v16i16` is converted into a series of vector_shuffle and umin operations in the SelectionDAG.
```
Initial selection DAG: %bb.0 'test_reduce_v16i16_with_sharing:start'
SelectionDAG has 47 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: v16i16,ch = load<(load (s256))> t0, t2, undef:i64
t9: v16i16 = vector_shuffle<8,9,10,11,12,13,14,15,u,u,u,u,u,u,u,u> t4, poison:v16i16
t10: v16i16 = umin t4, t9
t11: v16i16 = vector_shuffle<4,5,6,7,u,u,u,u,u,u,u,u,u,u,u,u> t10, poison:v16i16
t12: v16i16 = umin t10, t11
t13: v16i16 = vector_shuffle<2,3,u,u,u,u,u,u,u,u,u,u,u,u,u,u> t12, poison:v16i16
t14: v16i16 = umin t12, t13
t17: i32 = Constant<0>
t15: v16i16 = vector_shuffle<1,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u> t14, poison:v16i16
t16: v16i16 = umin t14, t15
t19: i16 = extract_vector_elt t16, Constant:i64<0>
t21: v1i16 = insert_vector_elt poison:v1i16, t19, Constant:i64<0>
t22: v16i16 = concat_vectors t21, t21, t21, t21, t21, t21, t21, t21, t21, t21, t21, t21, t21, t21, t21, t21
t24: v16i1 = setcc t4, t22, seteq:ch
t6: i64,ch = CopyFromReg t0, Register:i64 %1
t7: v16i16,ch = load<(load (s256))> t0, t6, undef:i64
t26: v16i16 = BUILD_VECTOR Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>
t27: v16i16 = vselect t24, t7, t26
t28: v16i16 = vector_shuffle<8,9,10,11,12,13,14,15,u,u,u,u,u,u,u,u> t27, poison:v16i16
t29: v16i16 = umin t27, t28
t30: v16i16 = vector_shuffle<4,5,6,7,u,u,u,u,u,u,u,u,u,u,u,u> t29, poison:v16i16
t31: v16i16 = umin t29, t30
t32: v16i16 = vector_shuffle<2,3,u,u,u,u,u,u,u,u,u,u,u,u,u,u> t31, poison:v16i16
t33: v16i16 = umin t31, t32
t38: i16,i16 = merge_values undef:i16, undef:i16
t39: i16,i16 = merge_values t19, undef:i16
t34: v16i16 = vector_shuffle<1,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u> t33, poison:v16i16
t35: v16i16 = umin t33, t34
t36: i16 = extract_vector_elt t35, Constant:i64<0>
t40: i16,i16 = merge_values t39, t36
t43: ch,glue = CopyToReg t0, Register:i16 $ax, t40
t45: ch,glue = CopyToReg t43, Register:i16 $dx, t40:1, t43:1
t46: ch = X86ISD::RET_GLUE t45, TargetConstant:i32<0>, Register:i16 $ax, Register:i16 $dx, t45:1
```
In the x86 backend, the `combineExtractVectorElt` function is supposed detect binary operation reductions and replace these with a more efficient implementation. The problem is that `combineExtractVectorElt` only replaces the result of `extract_vector_elt` (in this case t19 and 36), but it does not replace t16 and t35.
During the combination steps, t22 is transformed to `vector_shuffle<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0> t16, undef:v16i16`, so it depends on t16, and thus it keeps the reduction chain alive.
A possbile solution to this would be:
Let's say `N = extract_vector_elt(V,0)` and `V` comes from a reduction.
If `N` gets replaced with `M`, we could replace `V` with `scalar_to_vector(M)`.
This would eliminate all references to the inefficient reduction.
This is safe to do, since:
- All elements of V expect from the zeroth are known to be undefined
- M does not depend on V, so there will be no cycles.
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzsWE9v2zoS_zTMhaghkZIsH3Jw7aQo0HaBNi_YW0BLY5uvFKklqSR-n34xJOV_jWtsF3172SCGEnM485uZH2fEEc7JjQa4JeV7Ui5vxOC3xt66wYs_zc3KtLvbrfe9I3xO2D1h9xvTrozyE2M3hN3_hZ87oz_baStINifZ_GELVKzMM1B4FV2vgEqlBuet8OCo3wLtrVkp6CZ7eQtUWKCMdlLLbuiohXZovDTaUamp30pHnZZ9D56wBV0NnhqtdkGZg8bolkpHG9P1UkFLpfYmrMF6LRsJ2tO1sR26kM1JlaXfbP4Mr96Kxsuc1fS163LUvuu6DJ85CvSd1MMLroXvjp5xedtJ3Rt3LnJqJgVlLa1D4DCCHTy09EX6LXXbYb1W4KjQLQbBXcFKKdphP4HLzuCysIx2WvyX779mC8p40HjyQxgPcpTwZRQs36MwbozPcnlq8URl-P_EYrH_mi1oXf5gcLRYjBY5Kd9jQg6fSxbHZxEtWtXSHyzm1Vm-jmOY4nOWsy_GdkKpHUqQKlPquZs8Q-ONnQR-wmTopJ4855XMK1JlMa36GawfSSioAyvBUbOmcetTSnXINO6npgcrjrgO9BsoCOxfzj-kQ3IM7aOWXgpF3ShGl_MPhM8pYeVqNckoYVMPzj9FlE8R4BMS7clthZV6Q_jceWE9YVOSzY_t0a1wtJhSbVpILKTUZ6i-2RK22KgBQobutLe7B_MddBCh1DMUkhWGvdkGmYXpd_fWdF9hgzrYgn6FjXQeLOFzWRWIOIsWCtycQrnfr4xoCV8QVuNflLDasbIibIa__C7p9CGHg25hHdWOgGYHnUHdaQYIX9SELWZILdSTB5KhrhwZk6MfeUnYYvjpB2EEpvVGOqMJnycvsjn1eXYGIWQ8yvvZiDPPrwHFDQgFYzO9CuktiMHFtzBSn7M3QcYdCG6Eya_BHAvEfwbvBCa7DLN4GyaLMHmSmgYWcpYIqJ0X2hO-yAi_S56gWHnNl_yX_dj7cokWAUD1tjeRGnmZvAkUHkVSB3hKWEH5oIctjvwM_D_y1rNErlGJ1A7siY4jhPH0Bbs_1RrP-zlvGqMbMap2wXQ4nn_HY6xBB5IESA5804wHjgWqOPDwL8LnzfbgSvVLpStY9dP_onJVFyoXO2fH-z8-flo-Pd4tHv7x9SQvyKPFuxwTc5ax_6_8T1bioZuel5fYrANBMfHTSMh9OWD139Wr2PRSs2Ln_TKWJJaw1iNWft7UfkO7YrOLfYCfd8sEM-xAcCPM8_L0G9oVzy_DPO-WEWbcgeCiVJ0KPGGLUbADu4GnZ6EGcIfikJ_WipQzPruyP5Xys33px_Pznvob2iDnlwjHz5twChGPIdpXQ15d64K8vNIFi-xanHjiT0pfwX985cWe8GDe7giokxXiNSgpxpfa8qdKCv62lnavhc8jXQKaPCmtotKg7Z919fHbEt_T-fzr3cPTh09_3AW7bEEfhN2APw4KHokYlJ-gv4ynTCDO7kkf463lta7oSjTfQbdBfAt4a2pMt5Ia7mLSHkPO7pTH69J60PH6grf7oe-Ng5a24LFOrqQWdne4Gx1PBPDmZKFXogG04iDeoQXtjD2-88uuV9CB9kHDhD4cRg9o0m-Fv4IwDBmSqTi6sOAG5fE6R6rsRybiJsLqcWbRCAd4AgNkHtt_HF9IT1sDjmrjD77kVRD0vEy3vuWAl7VgOKKMsXAeepfeaIInVmi3NraDlnqDyH44xsjXX_-Et9mT-rO_9IY3KhP8gR5066jRo3BwZjs4XP0O0I8hTJmkzVZITYWSz5AcntPeOLeSCqgzaghSYZYjHX0xg2rpCuKl9BPg5dVRJ3bo8JcLpYGw-jG6McPcICJSZY_4d2M6cHRtTUfFARQC-Riy-wWFNuDdmKA0qyFV9jk5_oJ5QVRjCkfdo6BrhBL2yZsEibD6c4QyCUOhvVugZIfpBSqUohbWYEEH0sVJltQHXp9gDTrwAIk1oHBrQkakblKg3tG5UhTiSQiDiEcKrz0esuA7av8LrMETZIF-1-YlxHwFMdlSQxvUfD4wNqYaM_2Y8u_DEO9FKoUbtaHNrlHgJjftLW9nfCZu4DafFrNZzsq6vtnerrNVnfNiPWU5b_Isq6qa8bLI1k225nVV3chblrEyq_IiL4pZUU8aXpdlvipnK1bwuuWkyKATUk3CYMbYzY10boDbvCgYm94osQLlwliTMQ0vNKwSxki5vLG3uOndatg4UmRKOu8Oarz0KsxDj-cipFyS8v1rXZFySb-lGU6_s6KTYfiIYdknsb0ZrDqfm0q_HVaTxnSE3aOt9HjXW_MnNJ6w-4DQEXafXHi-Zf8OAAD___yJR8U">