<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/59766>59766</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            Very inefficient SIMD for Loop nest optimization in x86-64
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          kbuyukakyuz
      </td>
    </tr>
</table>

<pre>
    Hello,

Here is a simple code for L1 cache optimization using tiling loops
```
// Loop tiling
void tiled_loop() {
  for (int i = 0; i < N; i += blockSize) {
    for (int j = 0; j < blockSize; j++) {
      sum += array[i + j];
    }
  }
}
```

Here is the generated [assembly code](https://godbolt.org/z/P71M67hxs) with ` x86-64 clang 15.00-O3 -march=tigerlake` versus the corresponding GCC generated assembly code. 

As one can see from the assembly code GCC is performing significantly better than clang.

Interestingly, clang is up to 10x slower with a tiled loop compared to the normal array-filling methods. Here are my benchmarks for `N = 1000000  and blocksize = 16`

```
gcc Original loop: 0.0422719 seconds
gcc Tiled loop: 0.0250076 seconds

clang Original loop: 0.0292406 seconds
clang Tiled loop: 0.173324 seconds
```

The specific values that I get in my local machine are not very important.  But clang's assembly shouldn't be this bloated

</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJxsVF1vozoQ_TXOyyjImK_kgYcmUe5W2o8r7eq-XhkzATfGRrbpNvn1K-N006AiBMY-PnNm5mDunOw0Yk2KHSkOKz753tj63EyX6czPl-m6akx7qb-gUoawPaEHQp_i8wtaBOmAg5PDqBCEaRFOxsLXFAQXPYIZvRzklXtpNExO6g68VOGljBndja2ktzt-siNhR_hqzHgDx_lXI9swge3_YTNhG8K2QKpdXIY5MmEbqT1IINkBKMl283AP329DtgsLjTLi_FNeccHwwPFy53iZOe67wlSgCveCAMBNw3scbi2_kGI3B4YXUhxI9gFLqsP7x9_hfbCoykPJfY_QoUbLPbZAih13DodGXeYehDhs03s_OpLd6tmZtjHKJ8Z2hB2vhB3_rdJvZdW_uZDDb-l7ICWFt025LnMQiusO0iKhdP0jg_XArehJdvCyQ6v4GQP2Fa2bohhhrEU3Gt2G5v6z33-Q96AtgY_5PDkwGkFwDQ4RTtYMM93DlplOOhjRnowdQoBgWnmSgmuvLtCg92jB91xH4cnHGM_ao0Xnpe7UhbD9LTfpYBrBG0jpGzhlfqONVeDRZLNDQZhh5BbbAAzCtLEDV7Gx65NUs5cH9L1pXQJze7hFGIIoLfqB27OLpirp99lQKZ0vAK7baCknrxiXykW7Fx7ohIAfVnZSczXLI9kT0ITmjFXpFhwKo1t3x_76m8cNyApKq_IRGJ-xJp-Rsy3L6WJPRC_p0yrLWL5g_8zFv3oEN6IIHYRXriYMJuIenqFDD1KH-ikjuIKBi17qWFRtfLDcBeQwGuu59gnAbvKxoYRV7u4b15tJtZqwykOD4HvpQrGDH6OGVVtn7Tbb8hXWaVmxfLuh22LV15ylJ1rkeV7xAqkQWVVykXPeYMaqbYUrWTPKWMqyNM3zTVokm02b0rKpGixOPM9bklMcuFSJUq9D-OFW0rkJ62JbleVK8QaVez9vbR1A62bqHMmpks67-zYvvcL6vzlljaeTFBK1h5_P3w7xmA0O1ej84zkr9e0vXk1W1YtjQPp-ahJhBsKOIdDttR6teUHhCTvOYh1hx1nvnwAAAP__Rujhgg">