<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/62364>62364</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [FuncSpec] perf regression: loop not vectorized after function specialization
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            regression,
            performance,
            vectorization
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          vzakhari
      </td>
    </tr>
</table>

<pre>
    CPU2000/172.mgrid has slowed down from 28 seconds to 39 seconds on Icelake after https://reviews.llvm.org/D148345.  The benchmark was compiled with LLVM Flang.

There are two versions of `resid` function now that account for ~27 seconds, whereas before the change the single version was taking only 16 seconds.  The issues is that the innermost loop in the specialized version is not vectorized anymore:
```
   184        SUBROUTINE RESID(U,V,R,N,A)
   185        INTEGER N
   186        REAL*8 U(N,N,N),V(N,N,N),R(N,N,N),A(0:3)
   187        INTEGER I3, I2, I1
 188  C
   189        DO 600 I3=2,N-1
   190        DO 600 I2=2,N-1
 191        DO 600 I1=2,N-1
   192   600  R(I1,I2,I3)=V(I1,I2,I3)
 193       >      -A(0)*( U(I1,  I2,  I3  ) )
   194       >      -A(1)*( U(I1-1,I2,  I3  ) + U(I1+1,I2,  I3  )
   195       >                 + U(I1,  I2-1,I3  ) + U(I1,  I2+1,I3  )
   196       >                 + U(I1,  I2,  I3-1) + U(I1,  I2,  I3+1) )
   197       >      -A(2)*( U(I1-1,I2-1,I3  ) + U(I1+1,I2-1,I3  )
   198       >                 + U(I1-1,I2+1,I3  ) + U(I1+1,I2+1,I3  )
   199       >                 + U(I1,  I2-1,I3-1) + U(I1,  I2+1,I3-1)
   200       >                 + U(I1,  I2-1,I3+1) + U(I1,  I2+1,I3+1)
   201       >                 + U(I1-1,I2,  I3-1) + U(I1-1,I2,  I3+1)
   202       >                 + U(I1+1,I2,  I3-1) + U(I1+1,I2,  I3+1) )
   203       >      -A(3)*( U(I1-1,I2-1,I3-1) + U(I1+1,I2-1,I3-1)
   204       >                 + U(I1-1,I2+1,I3-1) + U(I1+1,I2+1,I3-1)
   205       >                 + U(I1-1,I2-1,I3+1) + U(I1+1,I2-1,I3+1)
   206       >                 + U(I1-1,I2+1,I3+1) + U(I1+1,I2+1,I3+1) )
```

I believe the function is specialized for calls like this:
```
  call void @resid_(ptr @x_, ptr getelementptr (i8, ptr @x_, i64 20247552), ptr getelementptr (i8, ptr @x_, i64 37823552), ptr %9, ptr getelementptr (i8, ptr @x_, i64 58071104))
```
To eventually become a call like this:
```
call fastcc void @resid_.1(ptr nonnull @x_, ptr nonnull %4)
```
And in the function body the references to `%1`, `%2` and `%4` are replaced with the corresponding `getelementptr` constants.
> FWIW, the argument `%0` stays as-is even though all the calls of `@resid_.1` seem to be passing `@x` as the first argument.  I suppose it could have been substitued in the functiuon body, but this is not the problem I am describing here.

Unfortunately, with this specialization the `loop-vectorize` pass is not able to vectorize the innermost loop due to some unsafe dependency:
```
--- !Analysis
Pass:            loop-vectorize
Name: UnsafeDep
DebugLoc:        { File: mgrid.f, Line: 192, Column: 2 }
Function:        resid_.1
Args:
  - String:          'loop not vectorized: '
  - String:          'unsafe dependent memory operations in loop. Use #pragma loop distribute(enable) to allow loop distribution to attempt to isolate the offending \
operations into a separate loop'
 - String:          "\nUnknown data dependence."
  - String:          ' Memory location is the same as accessed at '
  - Location: 'mgrid.f:192:2'
    DebugLoc:        { File: mgrid.f, Line: 192, Column: 2 }
...
--- !Missed
Pass:            loop-vectorize
Name: MissedDetails
DebugLoc:        { File: mgrid.f, Line: 191, Column: 7 }
Function:        resid_.1
Args:
  - String:          loop not vectorized
...
```

Whereas in the original `resid` copy the loop is vectorized (though, with dynamic pointer conflict checks):
```
--- !Passed
Pass:            loop-vectorize
Name: Vectorized
DebugLoc:        { File: mgrid.f, Line: 191, Column: 7 }
Function:        resid_
Args:
  - String:          'vectorized loop (vectorization width: '
  - VectorizationFactor: '4'
  - String: ', interleaved count: '
  - InterleaveCount: '1'
  - String: ')'
...
```

I am attaching the files for reproducer.  I can reduce them further, if needed.

[mgrid_orig.ll.gz](https://github.com/llvm/llvm-project/files/11325712/mgrid_orig.ll.gz) - LLVM IR after Flang FE.

[mgrid.opt.yaml.gz](https://github.com/llvm/llvm-project/files/11325715/mgrid.opt.yaml.gz) - optimization remarks.

[mgrid_loop_accesses_dump.log.gz](https://github.com/llvm/llvm-project/files/11325716/mgrid_loop_accesses_dump.log.gz) - `-debug-only=loop-accesses` output for both copies of `resid` function.

Clang invocation to compile `mgrid_orig.ll`:
```
clang -cc1 -triple x86_64-unknown-linux-gnu -emit-obj -disable-free -clear-ast-before-backend -main-file-name mgrid.ll -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=none -menable-no-infs -menable-no-nans -fapprox-func -funsafe-math-optimizations -fno-signed-zeros -mreassociate -freciprocal-math -fdenormal-fp-math=preserve-sign,preserve-sign -ffp-contract=fast -fno-rounding-math -ffast-math -ffinite-math-only -mconstructor-aliases -funwind-tables=2 -target-cpu icelake-server -target-feature +xsaves -target-feature +sse2 -target-feature -hreset -target-feature +avx512cd -target-feature +sha -target-feature +xsaveopt -target-feature -kl -target-feature -avxvnni -target-feature -mwaitx -target-feature -clzero -target-feature +sse4.2 -target-feature +bmi -target-feature -cldemote -target-feature -widekl -target-feature +avx512f -target-feature -raoint -target-feature +xsavec -target-feature +lzcnt -target-feature -serialize -target-feature -avxvnniint8 -target-feature +fsgsbase -target-feature +aes -target-feature +sse -target-feature -sse4a -target-feature -rdpru -target-feature -tbm -target-feature -avx512bf16 -target-feature -rtm -target-feature +fma -target-feature -waitpkg -target-feature -amx-fp16 -target-feature +avx512ifma -target-feature -avx512vp2intersect -target-feature +popcnt -target-feature +vaes -target-feature -prefetchi -target-feature +f16c -target-feature +avx2 -target-feature +sahf -target-feature +xsave -target-feature -uintr -target-feature +fxsr -target-feature +sgx -target-feature +pconfig -target-feature -avx512er -target-feature -avx512fp16 -target-feature +gfni -target-feature +rdseed -target-feature +bmi2 -target-feature -movdir64b -target-feature +avx512vl -target-feature +pku -target-feature -xop -target-feature +avx512bw -target-feature +avx512vbmi -target-feature +prfchw -target-feature +rdpid -target-feature +sse3 -target-feature +cx16 -target-feature +vpclmulqdq -target-feature +avx512vbmi2 -target-feature -enqcmd -target-feature -amx-bf16 -target-feature +64bit -target-feature -amx-int8 -target-feature -avx512pf -target-feature -ptwrite -target-feature -amx-tile -target-feature -lwp -target-feature +avx512vpopcntdq -target-feature +avx512dq -target-feature -avxneconvert -target-feature +mmx -target-feature -fma4 -target-feature +avx512vnni -target-feature -avxifma -target-feature +avx -target-feature +cmov -target-feature +sse4.1 -target-feature +movbe -target-feature +invpcid -target-feature +adx -target-feature +clwb -target-feature -prefetchwt1 -target-feature -cmpccxadd -target-feature +ssse3 -target-feature +cx8 -target-feature +clflushopt -target-feature -tsxldtrk -target-feature +pclmul -target-feature +crc32 -target-feature +rdrnd -target-feature +avx512bitalg -target-feature -shstk -target-feature -movdiri -target-feature +wbnoinvd -debugger-tuning=gdb -v -Ofast -ferror-limit 19 -fopenmp -fgnuc-version=4.2.1 -fcolor-diagnostics -vectorize-loops -vectorize-slp -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o mgrid.o -x ir mgrid_orig.ll
```

</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy8Wltv27iz_zTMy4CGRfn6kIfYjv8nQNv9o226jwFNjWxuKFJLUrbTh_PZD0jJri90ti0WJwgci5zrb4bDISPunFxrxHsynJHh4o43fmPs_fY7f91wK-9Wpni7n__3mfX7fcKW2Zj1qrWVBWy4A6fMDgsozE5DaU0FbAIOhdGFA28gnx6fjIYngYq_IvDSo4WN97Uj-QNhS8KWFrcSd66n1LbqGbsmbLnIBpN8MOwBfN0grFCLTcXtK-y4A2GqWiosYCf9Bj58-PYRlorrdY_0F6T_0H5-3aBF4BbB7wxs0TpptANTAhn1LTpZkFEfykYLL40GbXbgN9wDF8I02kNpLPwvGx9cIGwOuyCSO1hhaYLcDYLYcL1uvzqp1woPmqKhnr9KvQaj1Rtko4OozifpXIMOpGv1BhFSa7SVcR6UMTVI3QquUUiu5HcsjtKlA208bFF4Y-MM12-VsRgwbVEY9bvf-AgA2WQA3c-X59nnP56_Pn16hM-PX54WhE2eCZt_I2z-mbD5J8LmD4RNT1iHB9anT18f__P4GT6dTI4Ok58fHz4Q9jCBZ8ImnzpJn4KkKPxy6PP10ANhkz7JH_Jz9eNL9U95iMgTi59ZR5lNJgDzE7bpgW3xB4z6_cCVLwLPJ5r9IJv2L8nYFVk2zS6JsqQsBhCnITj3lBE2j0Y-RYfyxbfE6EFD3ikg-WP7hbZoBGDCt4hqZIbOc3jKAQibwhla00FSUHYpiB7tOBU0O6qZXc-fKBleKjn5OZUSjW1VJXR0rsxO5k90jH5FR2dnUJXUEWdbVZeAjZOAsRuA3XRmdj1_omTyU84cgzL7By03QZv-RmBugjY7mT_qYP3-b-j4Af0tLR3FiZ7s1zBLJsD57LUO9pO-zN7VcjmfSjTWTy_x_P1Eu6mKpoNztfx_KtHecehGEvxcBTj3JZUEsxTFiZ6fqwKX_ryjKUEzTe6a7ecTrFBJ3Lb7_LFlkO5saw4Ng-BKOVDyNZBKd3szDoSwNbIAMmi7kRfCJrW34Xn_EnIoPKzRo8IKtY9TbCInh6kjnRwNQhIPxsNhW7B-jTUfT1h-zkrY8JfFDCf9cZb1B1FMGsyvBnCL2jdcqTdYoTAVAm-h-EfIIlXJnRfiArde1iGnjdaNUucIHgfZcHDLsAddHHqtY3BD4xtHLJZoUQuMPW3gYcMs_pl3Tyw0klwX3eMgPtrAWSsuDm1q7BaNtehqo4vQFpJR_wzdwCeMdp5r7w69bP4Iyz-f_gzaggRu100g73QFH8B5_uaAOypdBBj8xjTrDQTEotaYlG3je4paYEWsglsrhDocBVqrAoDRCddiIq3zR809gCdwTV0bhyA9CNOocBzYhi4dNbhm5bz0DV5g2nSgBk9WjY_BPrSxgaq2ZqWwgifgFRTohJWrYE_ouc8a-2ddGusbzT2qKK2D93Q58hjDIJaM-qGXpsdOOTgWfD3o5iuFAYIjQaoTL5pI40LGNtrxEqHAGnWBWrzdSllKKRCWPWiu3px07eB_uQs5flrCLuyLZJ94FVp5eI7KFli3wwtcNesPRpxIIOMZLKWK1PFo1isDKB-kjkPZNG5Ic6OaSocBBmS8aKUtu2Q_kXZMjnZh2PWPBQlA4Yu3Uq_P7CdsHCE6P44EEsLG_8R5gaWHCitj38DUaGMMXUiiIL8Hzw6BsLy2fF3xLizSeStXjUfCJqhDKEM19ybkvtldEMWUMMC9x6r24at0RnHfRtyUJXbrctgdIc6sCJzgsOY2cATJP_y74R4jw7l-1q86HJAL7vmPpMFemP4HdOBji4Yygh92nHgg5KFwunBURefC2c-fg_2hY-iicEiL_CFkQ_7ATogB_t2c6vV6Z9n_UQYLfyv3W9YFei6V-_38z85tHf_L-Z9K_nMkkj3Fn91dQlcjjZVrqbk6u5wQpm73oPY2wJ0e9wmbtGX-WAGLN80rKaA2Unu0YSsplRQexAbFq4vHz3cLVQjPb4bq24Xv_09x-qUqdQJexJOwyWGoXV07WfjNVeH6dkqz5OGhoxncKHBhOHRFIQoK-RYLiFdKV6KfjhTzk_nsPbHT4-T76RV3Ue49F5tQ0tptXKGLXarF2pqiEWjjZi64BovhOdBVUDbWb9BGH0rQiAUWZzswGc5iBF9C1vaU6q2_k-GCsMn5rd5a-k2z6glTEbZUanv4Q2tr_kLhCVtGmwhbZlnOhuOMEba8ksymoaJ9-PYRnj53t4fxtg-Wj0mreqb2vTde_XtmDQ9mnYmOdpnay-qQQBYrbl9dGquQcy9dxXYvRVPVPWXW_5qNoyN0txVFi8moT4uwPKnR6o3ki7iuD_Sh7JjG1017AboyfhPKkMSbN6dn3s5jYKTeHnYsbw63tYH5LLYhY281-1EMFSID6q2sFcJ-MnoZDWjT7qZUSd3s6Vo3QLGSnprVX0AL6UILQEuLCFQo5JZy52l7X0tXXLyiLoBWXGoa8KM67KRtaJUCWlk87LW0MgUqqKUAWktBFW5RAWsfpKO1RKBVaXmFtKu4JF9oo8Nw24pQbajUpTsb0Fw7oCWva2v2NIAI4TO0QbTifkNPMypQakPjFX1Bv6M1LhrJnTNChj4k-CpkbY3gKvIDLQvUxlZc0bKOQyRf1BYd2i1GUYTNz56BlmVNhdHecuFJvgjHrFazNU3siQ6iw8zxu9TSH4zW6g1oFU8vtgk1knIluUMXvdtJXVAfIHAkXzCgnts1eirqBmT7fwEa7bHHqRK5b2zo-GZ7x7dB0PWMc8iuxukmOOdT9Hy7H2ZMFElZG35TuamvxdFXdT3Gt_ut1vJ6otpx6ffX40KFqN7wbdC79o6w2apKKBCqwMqEjLic2ckCE7Ye4SivWSwPGX0TDpGaUd9FgiWEtb0guQmW1H6SEli6tVtxd80YLL-ZDwkLHA6uQ0ttUdvmetivqqSlw4ytymyUkOOvGYL1VUJnSIL6dZ1QUO1pWSfEH6MkkwLbuW3NYgFyKJJBq02dig1hs20KSFpbLNGLzXWaBceyUTL-fLtPZqvjm-sMO2TSte5Gap-sAuXeJcfd-npZBZ9D-ytTUEfIEpWmm7kVhnWZWNeEzWzhEJMVZVXJRHWqzLaQdjRY3Y71Nrla69dEuu5NfVvOaveOjlQVCVpsKTZJPlvUMl05HeapcbFPI7mthaoa9Xfx9_vmJcBD_beoro2I6ye5PAmbjQYrmShMgSVZero8qBN1sfY7K1M1Ngjzocm5mlG7dwK0bVfme0Ak5oKBGoXRW7TJVV1ViZ2mrPjgHUuSuxbf7pN1p2VLxrwy25u7WZY01mxXyRov9bYW6YzjRVq52l2vqmM12_lr_VRUtRB7XtxI7JuZndyxhCpV4zbJbsG7vSq8fU3XqrAekhKtyJNV1RZWp6FpV770XCWKn9s4f21CV5OS9WC30kbqbQHtoWGNlvpGxyPpYl2sgG6B_tE1jGitsVTJSnrIpkBLU6OuaqDlWjeCdu8ukHwx6LGQDKUwylhaSL7WxnkpHPy4YKDhXHI24FQQxYvCulDYFy8v_5nPX_7n4dvjy-LPh89L9jJfPr08fPlI8kUG1HSNvQG6B2nh_PCROnbcFfd5Mc2n_A7vs9GEDQbjyTS_29xnWAwn_X4xzAqRZUU5GfOVEIPReJrnvGTjO3nP-izvD9gwm-STLO9lY-T9Ih_wcdYfloyRQR8rLtXxFZe7-PrH_Yjlo8Gd4itULr6Dw5jFtUUXkWKMsDlhrEZbhpZeCzyOnd1dxLvGxZ29j4fEVbN2ZNBX0vkf79TceelVfM9n2WjxpUZBhgsIkuFEY_6QutDqjt3Hf46c37PfNVbd__L5tX39hbBlhOD_AgAA__8C4Qwa">