<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/69929>69929</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            High dependency chain of shufflevector instructions

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          Rin18

      </td>

    </tr>

</table>

<pre>

    Compiling [this function](https://godbolt.org/z/4fWWd8jh9) with clang leads to a high chain of recurrences that causes slow performance. The chain of recurrence is equal to the highest integer n for src[i - n]. Looking at the assembly generated, there is a long list of ext instructions that all depend on one another, causing a high dependency chain:

```

ext     v3.16b, v20.16b, v4.16b, #12 // 1st extract, v3.16b is the destination vector with v20.16b and v4.16b as source vectors

ext     v25.16b, v21.16b, v3.16b, #12 // 2nd extract depends on the 1st, v25.16b is the destination vector with v21.16b and v3.16b(destination vector of the 1st extract) as source vectors

ext v26.16b, v0.16b, v25.16b, #12 // 3rd extract depends on the 2nd, v26.16b is the destination vector with v0.16b and v25.16b(destination vector of the 2nd extract) as source vectors

ext     v27.16b, v22.16b, v26.16b, #12 // 4th extract depends on the 3rd, v27.16b is the destination vector with v22.16b and v26.16b(destination vector of the 3rd extract) as source vectors

ext v28.16b, v23.16b, v27.16b, #12 // 5th extract depends on the 4th, v28.16b is the destination vector with v23.16b and v27.16b(destination vector of the 4th extract) as source vectors

ext     v0.16b, v24.16b, v28.16b, #12 // 6th extract depends on the 5th, v0.16b is the destination vector with v24.16b and v28.16b(destination vector of the 5th extract) as source vectors

```

To understand the performance issue let's start by looking at the case when n=2.  This is a slightly simpler example of the same function: [IR for n=2](https://godbolt.org/z/orsdTvE3a) . Something to look at is the shufflevector instructions used. Specifically the shuffles of shuffles and number of PHIs:

```

vector.body:

  %23 = phi i64 [ 0, %vector.ph ], [ %36, %vector.body: ] ; index = 0 if coming from ph or next.index if coming back to loop

  %24 = phi <4 x i32> [ %16, %vector.ph ], [ %27, %vector.body ] ; recurrence1 =  [ / / / Length-2] -> recurrence1 if coming from ph or src[i] if coming back to loop

  %25 = phi <4 x i32> [ %17, %vector.ph ], [ %28, %vector.body ] ; recurrence2 =  [ / / / Length-1] -> recurrence1 if coming from ph or src[i] if coming back to loop

  %26 = getelementptr inbounds i32, ptr %src, i64 %23 ; src[i] 

  %27 = load <4 x i32>, ptr %26, align 4 ; load src[i] 

  %28 = shufflevector <4 x i32> %24, <4 x i32> %27, <4 x i32> <i32 3, i32 4, i32 5, i32 6> ; shufflevec1 = recurrence1, load of src[i], mask <i32 3, i32 4, i32 5, i32 6> 

  %29 = shufflevector <4 x i32> %25, <4 x i32> %28, <4 x i32> <i32 3, i32 4, i32 5, i32 6> ; shufflevec2 = recurrence2, shufflevec1, mask <i32 3, i32 4, i32 5, i32 6>

```

We should be able to lower the recurrence chain by using less variables to keep track of it. This would mean we reduce both the number of PHIs (which each consumes a vector register in the loop) and the shuffles of shuffles. The target IR for n=2 should look something like this:

```

%vector.ph:

  %21 = getelementptr inbounds i32, ptr %src, i64 -2 ; Length-2

  %22 = load i32, ptr %21, align 4

  %23 = getelementptr inbounds i32, ptr %src, i64 -1 ; Length-1

  %24 = load i32, ptr %23, align 4

  %Ri = insertelement <4 x i32> poison, i32 %24, i64 3 ; [ / / / Length-1] -> recurrence1

  %Ri = insertelement <4 x i32> poison, i32 %22, i64 2 ; [ / / Length-2 Length-1] -> recurrencei

%vector.body: ; preds = %vector.body, %vector.ph

  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]

  %P1 = phi <4 x i32> [ %Ri, %vector.ph ], [ %wide.load, %vector.body ]

  %l6 = getelementptr inbounds i32, ptr %src, i64 %index

  %wide.load = load <4 x i32>, ptr %l6, align 4 ; load src[I]

  %shuf_vec1 = shufflevector <4 x i32> %P1, <4 x i32> %wide.load, <4 x i32> <i32 3, i32 4, i32 5, i32 6>

  %shuf_vec2 = shufflevector <4 x i32> %P1, <4 x i32> %wide.load, <4 x i32> <i32 2, i32 3, i32 4, i32 5>

 %index.next = add nuw i64 %index, 4

```

Furthermore, if we look at the case where n>4  we can remove n/4 additional shufflevector dependencies:

n=4 :

```

%vector.ph:

  %17 = getelementptr inbounds i32, ptr %src, i64 -4 ; Length-4

  %18 = load i32, ptr %21, align 4

  %19 = getelementptr inbounds i32, ptr %src, i64 -3 ; Length-3

  %20 = load i32, ptr %23, align 4

  %21 = getelementptr inbounds i32, ptr %src, i64 -2 ; Length-2

  %22 = load i32, ptr %21, align 4

  %23 = getelementptr inbounds i32, ptr %src, i64 -1 ; Length-1

  %24 = load i32, ptr %23, align 4

  %Ri = call <4 x i32> @llvm.build_vector(i32 %18, i32 %20, i32 %22, i32 %24) ; [ Length-4 Length-3 Length-2 Length-1]

%vector.body: ; preds = %vector.body, %vector.ph

  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]

  %P1 = phi <4 x i32> [ %Ri, %vector.ph ], [ %wide.load, %vector.body ]

  %l6 = getelementptr inbounds i32, ptr %src, i64 %index

  %wide.load = load <4 x i32>, ptr %l6, align 4 ; load src[I]

  %shuf_vec1 = shufflevector <4 x i32> %P1, <4 x i32> %wide.load, <4 x i32> <i32 3, i32 4, i32 5, i32 6>

 %shuf_vec2 = shufflevector <4 x i32> %P1, <4 x i32> %wide.load, <4 x i32> <i32 2, i32 3, i32 4, i32 5>

  %shuf_vec3 = shufflevector <4 x i32> %P1, <4 x i32> %wide.load, <4 x i32> <i32 1, i32 2, i32 3, i32 4>

  %shuf_vec4 = %P1

  %index.next = add nuw i64 %index, 4

```

n=5:

```

%vector.ph:

  %15 = getelementptr inbounds i32, ptr %src, i64 -5 ; Length-5

  %16 = load i32, ptr %15, align 4

  %17 = getelementptr inbounds i32, ptr %src, i64 -4 ; Length-4

  %18 = load i32, ptr %17, align 4

  %19 = getelementptr inbounds i32, ptr %src, i64 -3 ; Length-3

  %20 = load i32, ptr %19, align 4

  %21 = getelementptr inbounds i32, ptr %src, i64 -2 ; Length-2

  %22 = load i32, ptr %21, align 4

  %23 = getelementptr inbounds i32, ptr %src, i64 -1 ; Length-1

  %24 = load i32, ptr %23, align 4

  %R2i = insertelement <4 x i32> poison, i32 %16, i64 3 ; [ / / / Length-5]

  %Ri = call <4 x i32> @llvm.build_vector(i32 %18, i32 %20, i32 %22, i32 %24) ; [ Length-4 Length-3 Length-2 Length-1]

%vector.body: ; preds = %vector.body, %vector.ph

 %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]

  %P1 = phi <4 x i32> [ %Ri, %vector.ph ], [ %wide.load, %vector.body ]

  %P2 = phi <4 x i32> [ %R2i, %vector.ph ], [ %P1, %vector.body ]

  %l6 = getelementptr inbounds i32, ptr %src, i64 %index

  %wide.load = load <4 x i32>, ptr %l6, align 4 ; load src[I]

  %shuf_vec1 = shufflevector <4 x i32> %P1, <4 x i32> %wide.load, <4 x i32> <i32 3, i32 4, i32 5, i32 6>

 %shuf_vec2 = shufflevector <4 x i32> %P1, <4 x i32> %wide.load, <4 x i32> <i32 2, i32 3, i32 4, i32 5>

  %shuf_vec3 = shufflevector <4 x i32> %P1, <4 x i32> %wide.load, <4 x i32> <i32 1, i32 2, i32 3, i32 4>

  %shuf_vec4 = %P1

  %shuf_vec5 = shufflevector <4 x i32> %P2, <4 x i32> %P1, <4 x i32> <i32 3, i32 4, i32 5, i32 6>

  %index.next = add nuw i64 %index, 4

```

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsWl1v27gS_TXyyyCGRX3YevBDk9TYAvtQdAv0cUFJY4sbWvQlqTi5v_5iqA_TjhI73u3dxSJF0yoSOXPOkDyHlM2NEZsacRkkt0FyP-GNrZRefhN1uJjkqnxe3qntTkhRbyBIbm0lDKyburBC1UFyH7BFZe3OBNGngK0CttqoMlfSTpXeBGz134Ct4vWPH-XijyoLWAZ7YSsoJK83IJGXBqwCDpXYVFBUXNSg1qCxaLTGukADtuIWCt4YNGCk2sMO9VrpLa8LnML3Cse6gTCA_2m4pOi2QhcfjQVRW9yghhrWSoPRRZDcCrgBYjKFX5V6IJrcuk7cGNzm8hk2WKPmFsuA3dET7RJwkIpYCGMpOz5ReGN140rTIedSQok7rEtQNagagdeKQlAoouXytfzbdlgXzy0nKunsPph9CtJZ99f9Spnoz2M0DdOcAj2y2XAZ91cBi0IG7aBAaCwh1LywrpXrSiyIaInGipoTbHjEwirdDlMXFnhddnGBGzCq0QV2Dc0JJJYcMIXDZTSKidVlj6njbqhIhCg0Lcw23Hmc4QFnl2sx0lit--CHWmRvUXpk6cDhUOEDxyM6kX6VDqvLtmd6ER2v6n2uN-h4ZTxDpx2h-YEIO1ymo5xiW73GKdIdp_llQ8Q8UulZUl4xz47R4sDCWxHzUULJ64RiW7U9F5cRijxC87OEvEpeMkrebIsPl4tRTunrnJKO0-wySrFHaXGWUnIRpRPxav_9rqCpS9TGUjIK5sk6CGMaBIk2YHMDxnJtIX8GeSzPBTcI-wprqIPonk0BvpM3OWU2UmwqK5_BiO1OogZ84nTRQzd8iwcXiz6Rs3355kzBBbvM2JQ25ffHzxEn-lP4TW3RVoTQKgeWkHY1N1WzXkvsCnjkE43Bcgq_7bAQa1FwKZ_9HoYgD9dUrbrZ5ugG4esvX8xrJtFmmpKBD00AApawCILoHnaVAJHGRBxm7ZRKuj67Chz_O_cwYEmUHjfoglIrCKJbEHWJTy7qDMQaCrWlIqy12sKuAqopPtlp2-rwPOfFQ1epnY8vHvAF0V0MTyAiFkSfezBhegYtm79EO0A97BFCl6brtBp-fsV6Y6sbmgFwQ2n9HqPkul0EdThPLjlDbn6O3OIicuxNcuFPIpe6rBu0KHGLtd1Zmuq5akiRiCi7A7oXsITisrt2BnZz8tZP5oedu7BS8fK4aF445iYFl2JTQ-xiueavBFy4gMdL8mQ4aB66Sr-4PR-5Hd2JiEHkGEUM4v4i6S_Sttmtl7Sdf94AUFuHmhb8AJzubrl5uDiJxzO7iGcyznPxF_FkJzzdLPDK8E6Co2L3g-RSNbKEHIHnEtvZuUftpNQ7GLSHhfwZ2p23RGPgkWtBndxZ5AFxB-RpDzQOwk5bV9m76FvkNewpYNkUCLmylUtwLMkQsMW-EkUFyIsKClWbZkvi3funxo0wFmlxuO5uHZGDdl44Jv3tWcdyvUELvlX1zJ3hmMGCpHhAoMPaawbhi8yJQ4RXreMb5kZ-EFAvIDus4OPuLPTW7UuTejeE0IcQvnSVMQjROIRvwvUQtUHdozhZDzsljKr72TloBiFpBe09-vtnU7M-NXuRuh-St_KL01kxmHx0CzuNpXGYTp6fmJXH4bAneM9Ow_Wa0nZh3Oa8BF_Dt630mziTay9KnNKEOJtKXm1sjo8Xach53tLkW5b25RggicTvg6ecEfyv4ajgH5fjCt0fwcN-Kh7WZx8DNuA5mlUODy9pE70_HiN216_-E51cNdpWqLdKo4u-JgPod_f-OUQjCfLnGKhBwWvQuFWPdJOtYkoqaL_P5Uk9hrc-Ag9aTcpOg_5e6Q7n1-lm7Oumr4Lh4r3SHWbXQYh8CJEv3bP3SveHgXkuQifK0_UTz6R83E7zRsjy93YqBWzROUm48G1l9tJjDl6XDVbTT51hAEdN58NjPjzmr_CYf5jF-Hiin4on7LOPARvFE_dL6mt4unKudUQyp-QKb0quk8TEl8TED5i-Jolh8oo3_f_ssX2T87faY5h92OMb9siuOGW1Lx_PHvCSY-n8dzvxv8yIv7Izqdi5XJ26f7j9h9v_M9y-f5pchJaNoh0l8d73An9q2zEpl1GZRRmf4DJMs8VszlgYTaplgemCZ4gZhmmWZBzTeTgP-TqZhcm6iIuJWLIZi0L6ieOIpdNsztZ5Gc7zLMpnaYhBPMMtF3LqxFjpzcR9BLlMs4xlE8lzlMZ9P4axGvft55MBY0FyP9FL6nOTNxtDYi6MNYcoVliJy1_Gvtzhvdsd-VBw0mi5PPn4UdiqyaeF2gZsRRm6_252Wv2BhQ3YyuEyAVs53P8LAAD__9U0Ny0">