<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/144227>144227</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [SelectionDAG][x86] Shuffle pyramid is not eliminated
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          sutajo
      </td>
    </tr>
</table>

<pre>
    https://godbolt.org/z/zEonMr7da

The above example illustrates the problem.

There are 2 minimum reductions in this snippet, but only the second is compiled into the efficient form:
```
vextracti128 xmm1, ymm0, 1
vpminuw xmm0, xmm0, xmm1
vphminposuw xmm0, xmm0
```

The first one is computed with shuffles and mins:
```
vextracti128    xmm2, ymm0, 1
vpminuw xmm2, xmm0, xmm2
vpshufd xmm3, xmm2, 238                 # xmm3 = xmm2[2,3,2,3]
vpminuw xmm3, xmm2, xmm3
vpshufd xmm4, xmm3, 85                  # xmm4 = xmm3[1,1,1,1]
vpminuw xmm3, xmm3, xmm4
vpsrld  xmm4, xmm3, 16
vphminposuw     xmm2, xmm2
```

Normally, `llvm.vector.reduce.umin.v16i16` is converted into a series of vector_shuffle and umin operations in the SelectionDAG.

```
Initial selection DAG: %bb.0 'test_reduce_v16i16_with_sharing:start'
SelectionDAG has 47 nodes:
  t0: ch,glue = EntryToken
    t2: i64,ch = CopyFromReg t0, Register:i64 %0
  t4: v16i16,ch = load<(load (s256))> t0, t2, undef:i64
    t9: v16i16 = vector_shuffle<8,9,10,11,12,13,14,15,u,u,u,u,u,u,u,u> t4, poison:v16i16
 t10: v16i16 = umin t4, t9
    t11: v16i16 = vector_shuffle<4,5,6,7,u,u,u,u,u,u,u,u,u,u,u,u> t10, poison:v16i16
  t12: v16i16 = umin t10, t11
    t13: v16i16 = vector_shuffle<2,3,u,u,u,u,u,u,u,u,u,u,u,u,u,u> t12, poison:v16i16
  t14: v16i16 = umin t12, t13
  t17: i32 = Constant<0>
      t15: v16i16 = vector_shuffle<1,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u> t14, poison:v16i16
    t16: v16i16 = umin t14, t15
  t19: i16 = extract_vector_elt t16, Constant:i64<0>
  t21: v1i16 = insert_vector_elt poison:v1i16, t19, Constant:i64<0>
      t22: v16i16 = concat_vectors t21, t21, t21, t21, t21, t21, t21, t21, t21, t21, t21, t21, t21, t21, t21, t21
    t24: v16i1 = setcc t4, t22, seteq:ch
      t6: i64,ch = CopyFromReg t0, Register:i64 %1
 t7: v16i16,ch = load<(load (s256))> t0, t6, undef:i64
    t26: v16i16 = BUILD_VECTOR Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>, Constant:i16<-1>
  t27: v16i16 = vselect t24, t7, t26
    t28: v16i16 = vector_shuffle<8,9,10,11,12,13,14,15,u,u,u,u,u,u,u,u> t27, poison:v16i16
 t29: v16i16 = umin t27, t28
    t30: v16i16 = vector_shuffle<4,5,6,7,u,u,u,u,u,u,u,u,u,u,u,u> t29, poison:v16i16
  t31: v16i16 = umin t29, t30
    t32: v16i16 = vector_shuffle<2,3,u,u,u,u,u,u,u,u,u,u,u,u,u,u> t31, poison:v16i16
  t33: v16i16 = umin t31, t32
  t38: i16,i16 = merge_values undef:i16, undef:i16
 t39: i16,i16 = merge_values t19, undef:i16
        t34: v16i16 = vector_shuffle<1,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u> t33, poison:v16i16
 t35: v16i16 = umin t33, t34
    t36: i16 = extract_vector_elt t35, Constant:i64<0>
  t40: i16,i16 = merge_values t39, t36
  t43: ch,glue = CopyToReg t0, Register:i16 $ax, t40
  t45: ch,glue = CopyToReg t43, Register:i16 $dx, t40:1, t43:1
  t46: ch = X86ISD::RET_GLUE t45, TargetConstant:i32<0>, Register:i16 $ax, Register:i16 $dx, t45:1
```

In the x86 backend, the `combineExtractVectorElt` function is supposed detect binary operation reductions and replace these with a more efficient implementation. The problem is that `combineExtractVectorElt` only replaces the result of `extract_vector_elt` (in this case t19 and 36), but it does not replace t16 and t35.

During the combination steps, t22 is transformed to `vector_shuffle<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0> t16, undef:v16i16`, so it depends on t16, and thus it keeps the reduction chain alive.

A possbile solution to this would be:
Let's say `N = extract_vector_elt(V,0)` and `V` comes from a reduction.
If `N` gets replaced with `M`, we could replace `V` with `scalar_to_vector(M)`.
This would eliminate all references to the inefficient reduction.
This is safe to do, since:
- All elements of V expect from the zeroth are known to be undefined
- M does not depend on V, so there will be no cycles.
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzsWE9v2zoS_zTMhaghkZIsH3Jw7aQo0HaBNi_YW0BLY5uvFKklqSR-n34xJOV_jWtsF3172SCGEnM485uZH2fEEc7JjQa4JeV7Ui5vxOC3xt66wYs_zc3KtLvbrfe9I3xO2D1h9xvTrozyE2M3hN3_hZ87oz_baStINifZ_GELVKzMM1B4FV2vgEqlBuet8OCo3wLtrVkp6CZ7eQtUWKCMdlLLbuiohXZovDTaUamp30pHnZZ9D56wBV0NnhqtdkGZg8bolkpHG9P1UkFLpfYmrMF6LRsJ2tO1sR26kM1JlaXfbP4Mr96Kxsuc1fS163LUvuu6DJ85CvSd1MMLroXvjp5xedtJ3Rt3LnJqJgVlLa1D4DCCHTy09EX6LXXbYb1W4KjQLQbBXcFKKdphP4HLzuCysIx2WvyX779mC8p40HjyQxgPcpTwZRQs36MwbozPcnlq8URl-P_EYrH_mi1oXf5gcLRYjBY5Kd9jQg6fSxbHZxEtWtXSHyzm1Vm-jmOY4nOWsy_GdkKpHUqQKlPquZs8Q-ONnQR-wmTopJ4855XMK1JlMa36GawfSSioAyvBUbOmcetTSnXINO6npgcrjrgO9BsoCOxfzj-kQ3IM7aOWXgpF3ShGl_MPhM8pYeVqNckoYVMPzj9FlE8R4BMS7clthZV6Q_jceWE9YVOSzY_t0a1wtJhSbVpILKTUZ6i-2RK22KgBQobutLe7B_MddBCh1DMUkhWGvdkGmYXpd_fWdF9hgzrYgn6FjXQeLOFzWRWIOIsWCtycQrnfr4xoCV8QVuNflLDasbIibIa__C7p9CGHg25hHdWOgGYHnUHdaQYIX9SELWZILdSTB5KhrhwZk6MfeUnYYvjpB2EEpvVGOqMJnycvsjn1eXYGIWQ8yvvZiDPPrwHFDQgFYzO9CuktiMHFtzBSn7M3QcYdCG6Eya_BHAvEfwbvBCa7DLN4GyaLMHmSmgYWcpYIqJ0X2hO-yAi_S56gWHnNl_yX_dj7cokWAUD1tjeRGnmZvAkUHkVSB3hKWEH5oIctjvwM_D_y1rNErlGJ1A7siY4jhPH0Bbs_1RrP-zlvGqMbMap2wXQ4nn_HY6xBB5IESA5804wHjgWqOPDwL8LnzfbgSvVLpStY9dP_onJVFyoXO2fH-z8-flo-Pd4tHv7x9SQvyKPFuxwTc5ax_6_8T1bioZuel5fYrANBMfHTSMh9OWD139Wr2PRSs2Ln_TKWJJaw1iNWft7UfkO7YrOLfYCfd8sEM-xAcCPM8_L0G9oVzy_DPO-WEWbcgeCiVJ0KPGGLUbADu4GnZ6EGcIfikJ_WipQzPruyP5Xys33px_Pznvob2iDnlwjHz5twChGPIdpXQ15d64K8vNIFi-xanHjiT0pfwX985cWe8GDe7giokxXiNSgpxpfa8qdKCv62lnavhc8jXQKaPCmtotKg7Z919fHbEt_T-fzr3cPTh09_3AW7bEEfhN2APw4KHokYlJ-gv4ynTCDO7kkf463lta7oSjTfQbdBfAt4a2pMt5Ia7mLSHkPO7pTH69J60PH6grf7oe-Ng5a24LFOrqQWdne4Gx1PBPDmZKFXogG04iDeoQXtjD2-88uuV9CB9kHDhD4cRg9o0m-Fv4IwDBmSqTi6sOAG5fE6R6rsRybiJsLqcWbRCAd4AgNkHtt_HF9IT1sDjmrjD77kVRD0vEy3vuWAl7VgOKKMsXAeepfeaIInVmi3NraDlnqDyH44xsjXX_-Et9mT-rO_9IY3KhP8gR5066jRo3BwZjs4XP0O0I8hTJmkzVZITYWSz5AcntPeOLeSCqgzaghSYZYjHX0xg2rpCuKl9BPg5dVRJ3bo8JcLpYGw-jG6McPcICJSZY_4d2M6cHRtTUfFARQC-Riy-wWFNuDdmKA0qyFV9jk5_oJ5QVRjCkfdo6BrhBL2yZsEibD6c4QyCUOhvVugZIfpBSqUohbWYEEH0sVJltQHXp9gDTrwAIk1oHBrQkakblKg3tG5UhTiSQiDiEcKrz0esuA7av8LrMETZIF-1-YlxHwFMdlSQxvUfD4wNqYaM_2Y8u_DEO9FKoUbtaHNrlHgJjftLW9nfCZu4DafFrNZzsq6vtnerrNVnfNiPWU5b_Isq6qa8bLI1k225nVV3chblrEyq_IiL4pZUU8aXpdlvipnK1bwuuWkyKATUk3CYMbYzY10boDbvCgYm94osQLlwliTMQ0vNKwSxki5vLG3uOndatg4UmRKOu8Oarz0KsxDj-cipFyS8v1rXZFySb-lGU6_s6KTYfiIYdknsb0ZrDqfm0q_HVaTxnSE3aOt9HjXW_MnNJ6w-4DQEXafXHi-Zf8OAAD___yJR8U">