<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/56520>56520</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [X86] Compiling specific AVX2 intrinsics leads to incorrect removal of a shuffle/broadcast in backend
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Benjins
      </td>
    </tr>
</table>

<pre>
    This is a presumed miscompilation in the X86 backend that started occurring recently.

If we compile this IR:

```llvm
; Function Attrs: mustprogress nofree nosync nounwind readonly willreturn uwtable
define <2 x i64> @do_stuff(<16 x i8> %0) #0 {
  %2 = icmp eq <16 x i8> zeroinitializer, %0
  %3 = extractelement <16 x i1> %2, i64 0
  %4 = sext i1 %3 to i32
  %5 = insertelement <2 x i32> zeroinitializer, i32 %4, i64 0
  %6 = zext <2 x i32> %5 to <2 x i64>
  %7 = shufflevector <2 x i64> %6, <2 x i64> zeroinitializer, <2 x i32> zeroinitializer
  ret <2 x i64> %7
}

attributes #0 = { mustprogress nofree nosync nounwind readonly willreturn uwtable "frame-pointer"="none" "min-legal-vector-width"="128" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+avx,+avx2,+crc32,+cx8,+fxsr,+mmx,+popcnt,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave" "tune-cpu"="generic" }
```

Before commit 72a23cef7e661997118a1c4876c814cfd05d72e4, this would compile to the following assembly:

```asm
      vpxor   xmm1, xmm1, xmm1
      vpcmpeqb        xmm0, xmm0, xmm1
      vmovd   eax, xmm0
      movsx   eax, al
      vmovd   xmm0, eax
      vpbroadcastq    xmm0, xmm0  ; <--- N.B: This line is missing in the trunk case
      ret
```

However, after that commit it produces the following:

```asm
      vpxor   xmm1, xmm1, xmm1
      vpcmpeqb        xmm0, xmm0, xmm1
      vmovd   eax, xmm0
      movsx   eax, al
      vmovd   xmm0, eax
      ret
```

Godbolt link (comparing trunk and 14.0.0): https://godbolt.org/z/Wdase3dM6

The lack of a broadcast means that the upper 64 bits of xmm0 are zero (in this case), instead of being a mirror of the lower half (should be 0xFFFFFFFF if 0 is passed to the function). This is due to undef values, but as far as I can tell there is no UB, because no undef values should be 'seen' by the computations.

A walkthrough of what's happening (apologies that this is somewhat involved, I'm not sure which part to focus on):

First, the IR is lowered to this DAG initially:

```console
SelectionDAG has 23 nodes:
  t0: ch = EntryToken
  t16: v2i64 = BUILD_VECTOR Constant:i64<0>, Constant:i64<0>
          t13: v2i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>
                t6: v16i8 = BUILD_VECTOR Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>, Constant:i8<0>
                  t2: i64,ch = CopyFromReg t0, Register:i64 %0
                t4: v16i8,ch = load<(load (s128))> t0, t2, undef:i64
              t8: v16i1 = setcc t6, t4, seteq:ch
            t10: i1 = extract_vector_elt t8, Constant:i64<0>
          t11: i32 = sign_extend t10
        t14: v2i32 = insert_vector_elt t13, t11, Constant:i64<0>
      t15: v2i64 = zero_extend t14
    t18: v2i64 = vector_shuffle<0,0> t15, undef:v2i64
  t21: ch,glue = CopyToReg t0, Register:v2i64 $xmm0, t18
  t22: ch = X86ISD::RET_FLAG t21, TargetConstant:i32<0>, Register:v2i64 $xmm0, t21:1
```

Part of the BeforeLegalizeTypes combine sees that we don't use the second element of t14 here:

```console
t14: v2i32 = insert_vector_elt t13, t11, Constant:i64<0>   ; <--- N.B. t13 is a zero literal
```

so it can be transformed into:

```console
t25: v2i32 = BUILD_VECTOR t11, undef:i32
```

Later, as part of the AfterLegalizeVectorOps combine, we are looking at this part of the DAG:

```console
        t38: v4i32 = BUILD_VECTOR t35, Constant:i32<0>, undef:i32, Constant:i32<0>
      t39: v2i64 = bitcast t38
    t18: v2i64 = vector_shuffle<0,0> t39, undef:v2i64
```

combineShuffleOfSplatVal wants to know if t39 is a splat vector. It thinks it is, because t35 ~ undef, and 0 ~ 0. Because of this, it removes the shuffle since it looks like all the elements are interchangeable.

Previously, we would reject this optimization because isSplatValue would be more strict in not allowing any undef in t38, instead of allowing partially undef elements. This is from this specific change in SelectionDag.cpp:

```cpp
UndefElts |= APIntOps::ScaleBitMask(SubUndefElts, NumElts, /*MatchAllBits=*/true); // N.B. MatchAllBits was effectively false before
```

So to summarise:

 1. An insert is changed to use an undef as the input vector instead of a zero'd one, since its second element is unused
 2. The bitcast vector is considered a splat, since the undef is placed such that it appears that all the elements of it are equivalent
 3. Partially undef elements are now allowed due to 72a23cef7e661997118a1c4876c814cfd05d72e4, and it is still considered a splat for optimization purposes

I'm not sure where the underlying issue is: all these steps seem reasonable to me at a cursory look, but they end up removing the broadcast, which changes the semantics of the code.

Original C++ repro (minimised):

```cpp
// Compiled with -mavx2 -O1
__m128i do_stuff(__m128i I0, __m128i I1) {
        __m128i A = _mm_cmpeq_epi8(I1, I0);
        __m128i B = _mm_cvtepi8_epi32(A);

        int Scalar = _mm_cvtsi128_si32(B);

        __m128i NewVector = _mm_cvtsi32_si128(Scalar);
        __m128i ZextVector = _mm_cvtepu32_epi64(NewVector);
        __m128i DoubledZext = _mm_broadcastq_epi64(ZextVector);

        return DoubledZext;
}
```

Godbolt link (comparing trunk w/ 14.0.0): https://godbolt.org/z/M3exGheG8

I tested this on latest trunk (04419a5f55d7eebc2d2f3c9fb74d3a417b7964c1) and confirmed it's still present.

For context: this was found by a fuzzer to test SIMD codegen, it was not in manually-written code.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJztWVlz2zgS_jXyC0oq8dD14Af5yroqmaRiJzs1LyqQBCWMSYIhQMnKr9-vAZCSbNmb2eNla1WySRHoRqOPr7vBRGX7y8eN1AxfzupG6LYUGSulTlVZy4IbqSomK2Y2gv0-n7KEp0-iyvCbG6YNbwymqzRtm0ZWa9aIVFSm2I8G45vBeOn-3-dsJ5hjKECJte6_DqLl8ZzBdOy-RbEt_aPoit21VWpFWBrTaNCwstWmbtQaompWqbwRAhe9r1Jc2monIVwjeKaqYs92sigaYdqmYu3O8KQQjnUmclkJNoiuQ_bM5DQeRLdsEI8ztdKmzfNBOMdYMKXBuR0LJ-NBuMA1GrPB7MqxYfQ8BJsbJtOyZuIHOyX7KRolK2kkLyTuB-G143Sgjiy1eDYNT40oRAn1HZgEfu2QKCEmOyaNLakGLSY6XkYxGYVHcyZOuEqL5pi73TUmnhcRI5b_uUWnluFPWvSUj10M65_o9Ihw5qTdQLuF2IrUqOal_sHdaujk6TkVvrsBvyTM_pr_zHvW7ObY9zh8SyatEdrbF4LCxv-uq4FZmDe8FMMaAhqSHQLf4H-lKoELTShlNSzEmhdDp5PhTmZm088M4IhuYqWG8JG6RpQNS340xTRtxwzxmD4NIbJxvBIoWzRDDb300zt-CN21MMO0bvuh5_l0CFWdjOeCY09C95MG4RXfPsMM7iZ0d2mTRt3t89zd5M-6cXdl6QlqVaeVcfdai_4m7O-i_i4eBUf3_Yx-yvN85m8033YaMG0lTva0FpVoZGqHe6t3WHPsBFciV41FqVIaNgt5GKUin4npNFgsZkEw50Eaz2fTdB7EaZ6NJ9ksFDZELKDtVFtkB4xTFjBzVRRqR7jIIXeZFPu3UI_rsnNc-mzrZ4QHY89lSUo4vZ7MA-yIHwnzH4yP_bzx2fml2ma4Cv7cTzsaxqh-Pgzz4hxtx5smnYiSNIpnKdfmx0tRAADAckTjcDhkv42uCMdt1ikIhXFFvtGkJp9o4NLVEwMrcbwCousd-_1N7QArFh94jmBzGcqbE19ERdamCPETw_zv2uN9bX1QWaIKQwZ4QtzMyXG5zeBO9xzQFsSj8YiyHplrY0xN-XcQ3uG7duQj1azx6yf-_p7BWlH2aXq8yiNUXQCSmMpRXfT-wUrBK-0MRNZo6xr2QqJJpNE01_oMRzQStpN41i_gJ9YlIBAlpkobYC9NT4QNMXhR08BMeEJcYWBw3fAiJw56YwM0EWz8fOc_TOZsTO5XU3hmfdT6mgMLjVhXHGWtjeq2QunAtrxoCRKvGbIGYpvlvKHLPQSEpKIoiFFjXbtS7NuVnSpS3mrKICdc2EGyQTjTQmDdGUv2VhQyS2tsEaZPSqol2_HiyWwa1a43tOMdlEn02DC0WZFCsGteq0KtpeiV7TajVSmIAErcqmIrMpLvHuQlpENZB8AHQ5luoJnG0L5zlbYwTeW84ViSO9lo42BQoKwj9lbznT7x-2b5gfkM_TYCptii6uqzB9Qq1gZEuoFmwwiSZUL31IyZMfklZKR0fVuZZv-oUJr2w8GUxrchVTA05erb_ceb1ffb68fPX9k1VjMcqSha2urgekzVCnZxfuAoruhjgsjzjsL3eVOJcpb3YeAFb7-CEz6Yyvn7C8zf4P__5_-l52fNBYOFZDBymfDa--S1qvd3jSq_ijU5KxjiTmoqBK1znbQCL7jFvfkP_ArAJ6RAWNOdBTVbHi5sUN76NYztFSzCeB8-s4KZd_wD30SYNCWnIwa2qMET8QOT0s1rehPY0PO0vndZufp1JZBVzPwvhFJgeflA0nJdrcDRtpjBC-2YID4NO9fXnKwcRHYPQfBLEphgcgoSlHAO6x_pzgTz05l-Ud_QWN6h5W-ZHpnAkvSoFAYOtTBjDfjvHeVRnXUTv14Yd1me5Oh5hUcIiOb8_uGG8DFafr19XN19BHTSciB6tLX8m7j07npW4OCdUuILpQifc10R_ZH6GfQcj_samQc5LKEyD6nNp6GdYBmlkplhlBCJUAvgf8a6FpXYBTGjHPpL-eI_4xnsZZ06Iip3NGIrkUJCS10VdlYZWlGtSVVAQmUsyhwohE5U0ACqX9tKOBm8nVu8-H14d83-WWE-cuMLYu0SubfSkurjzkjfrYo-172hiAAWovKrUOrJlla-dDhmgsz8S_vpgzdyARSf31c0eTtDvtjwr2RSEy1OwxWlpS09SYx_KaijxRtBfVb1XpcPjtHn_KEuuPnOC5RtFWpcVEZPldpRAQrGzsM0TfESjNi9VXn1pMmfpD4uIKEr9LK3XhYyLyJnbB-NRwhBN8uayRGCQyPQLfjmx-8OWFulggbJzNSLPcHqrnTtAlFbN7CHF-mGV2tB5xonheiXRmylajUKO-c3rhFuxJ_YiPMaVRtZyp_uKLHbhNSdStqOBhFTUguuDRp2qk1tMcr7Frra-8KZuoFo_qIJ6OeRk9pK08_utnKo5XMkZSearkUqc5kytzvifCg9-XqU1vWbTo4h--QbrXJbQFeD2TU50fLLfWU-u24pWj6kvBBX0nzi-gkp-6FNegLawW9t2d3a1mr5iZt0sywKkGh7hEEtlzvjWVh0sg2YQ6fjufAszUSek_Bbgd3nvICeE4vI73jqgyJv1G1Zov_TL-GWBSO2rDycku6comxtT2YE0Dktc-dbskK_4p34xDoWQAH5-OUgpvM-_RL7sUhbgXfmJQjJcKIP4I434VWlZWY7DR89B762rXTOAthCD4pJukWqtAkIPk9dEm98Qnrl9ZCY5sAZxY9WolHDUy9ONGJf3vAwS0BhbX0RK_qm8a8cJlEs24hHGEiI9XqT6MWa05iq26ZWGs3R8Zn7y26O2tFOK02xt6ctWrcUiYSBXgWawk_UZBRR0uGmVpU9y8Q2SkGJgLO0bbRq9hY2uh4YpHtGRVNbO6yxRwlktq7pt_Bge0rnQh6LRAlAlKnu0kqKLu8EYD43ci0rICcd9eEL9rU7FCjRU5aSHOVVU_o6Sn3YXLsDuoztpNmwYUknmGz42dc3q1WJilqyo9cA3aN7Ww31vwL7MqB7DzAYL7qRpc0jq7Jc2ZOglaiphJ_f27R9705TXlNdHai2hkiIjlLdfHlK4emAyIyAhTfHhFqC2Uo7wquzhN2Cv4nd9-4M_og-CleWCQGVZf-GvH-gRH5NL-oWDCA5dUHzfok3eNyoFo6V_eFeJTgmhyPEns1hqbMb8kfuR8wOc9497_1nJ2A7cpe_dAT2KRLPHzbiw_wkDpkRmt6RuVRYMUSwoDrELoJ1x3EcLPgknwADhEjSMAvzKF3kySzOIh4Hs2S2mMapdTgCBwBCLl1J6Q58HE7Qqztg0Eno3ME8mG6sUpb-nJpOqxRAgI6YOMvbnz_prFRZMdnD_acbG4JrOoaylQMREIwgNSJUW4K94a6RBi2SC9aL7DLKFtGCXxhpCnE5mFyhFxlMbnywkUr7PLv8_ntI9QQUrSnqC-QHWw4BtVXTUM1g4QPxbrNGV4mFd4fTQ0ji30JetE1x-cIsiOs2GcGa-GHfJboLvRWhkgQ_Lewh495NppNwfLG5hO7jaZImcZ6Ei3Q8TcPpZDGOJ_EiWSySmbgoeCIKTTujFzFi55CT3jFMbi7kZTgOw_EsiMfBeBrFI6TcScznMQ-CeD7NZ4N4DJSTxYjkIH-5aC6tSEm71hgs0HzpwyDX1AQLq0jiz1uzUc3llaj-hNIu7NKXVvR_ANh05Ck">