<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/152639>152639</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[Flang][OpenMP] Heap allocation for private variables prevents SROA optimization
</td>
</tr>
<tr>
<th>Labels</th>
<td>
llvm:optimizations,
flang,
missed-optimization,
flang:openmp
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
yus3710-fj
</td>
</tr>
</table>
<pre>
I found that 535.weather_t in SPEChpc 2021 experiences an 8.9% slowdown on NVIDIA Grace after #125732. This is because one of the loops is no longer vectorized. If OpenMP is disabled, the loop is still vectorized because `memcpy` is not called within the loop. (In addition, OpenMP with `-mllvm -disable-loop-idiom-memcpy` flag also helps vectorization.)
reproducer: https://godbolt.org/z/5TYohj1d6
Before the patch, the array assignment for `stencil` was replaced with `memcpy` in LoopIdiomRecognize once, and then both `stencil` and `memcpy` were removed in SROA. We can remove `stencil` here because it is just an alias of `in3d(i-2:i+1, k)`. After the patch, the optimization in SROA doesn't work, and `memcpy` remains in the loop. It seems that the private variable of `stencil` is allocated not on the stack but on the heap, which prevents the optimization.
The following is LLVM IR before SROA.
* Before the patch
```llvm
define internal void @_QMreproPsub..omp_par(ptr noalias readnone captures(none) %tid.addr, ptr noalias readnone captures(none) %zero.addr, ptr readonly captures(none) %0) #0 {
omp.par.entry:
%loadgep_.reloaded = load ptr, ptr %0, align 8
%gep_.reloaded12 = getelementptr i8, ptr %0, i64 8
%loadgep_.reloaded12 = load ptr, ptr %gep_.reloaded12, align 8
%gep_ = getelementptr i8, ptr %0, i64 16
%loadgep_ = load ptr, ptr %gep_, align 8
%gep_1 = getelementptr i8, ptr %0, i64 24
%loadgep_2 = load ptr, ptr %gep_1, align 8
%p.lastiter = alloca i32, align 4
%p.lowerbound = alloca i32, align 4
%p.upperbound = alloca i32, align 4
%p.stride = alloca i32, align 4
%1 = load i64, ptr %loadgep_.reloaded, align 8
%2 = load i64, ptr %loadgep_.reloaded12, align 8
%3 = alloca [4 x double], align 8
%4 = alloca [4 x double], align 8 ;; stencil(private)
...
omp.wsloop.region6.preheader: ; preds = %omp_loop.body
%scevgep = getelementptr i8, ptr %loadgep_2, i64 %25
call void @llvm.memcpy.p0.p0.i64(ptr align 8 %4, ptr align 8 %scevgep, i64 32, i1 false), !tbaa !9 ;; stencil(:) = in3d(i-2:i+1, k)
br label %omp.wsloop.region8
```
* After the patch
```llvm
define internal void @_QMreproPsub..omp_par(ptr noalias readnone captures(none) %tid.addr, ptr noalias readnone captures(none) %zero.addr, ptr readonly captures(none) %0) #2 {
omp.par.entry:
%loadgep_.reloaded = load ptr, ptr %0, align 8
%gep_.reloaded23 = getelementptr i8, ptr %0, i64 8
%loadgep_.reloaded23 = load ptr, ptr %gep_.reloaded23, align 8
%gep_3 = getelementptr i8, ptr %0, i64 32
%loadgep_4 = load ptr, ptr %gep_3, align 8
%gep_5 = getelementptr i8, ptr %0, i64 40
%loadgep_6 = load ptr, ptr %gep_5, align 8
%p.lastiter = alloca i32, align 4
%p.lowerbound = alloca i32, align 4
%p.upperbound = alloca i32, align 4
%p.stride = alloca i32, align 4
%1 = load i64, ptr %loadgep_.reloaded, align 8
%2 = load i64, ptr %loadgep_.reloaded23, align 8
%3 = load i32, ptr @_QMreproEnx, align 4, !tbaa !3
%4 = tail call dereferenceable_or_null(32) ptr @malloc(i64 32) ;; stencil(private)
...
omp.wsloop.region10.preheader: ; preds = %omp_loop.body
%scevgep = getelementptr i8, ptr %loadgep_6, i64 %25
call void @llvm.memcpy.p0.p0.i64(ptr align 8 %4, ptr align 8 %scevgep, i64 32, i1 false), !tbaa !9 ;; stencil(:) = in3d(i-2:i+1, k)
br label %omp.wsloop.region12
```
And we can get the following IR by SROA.
* Before the patch
```llvm
;; only loads in3d(i-2:i+1, k)
omp.wsloop.region6.preheader: ; preds = %omp_loop.body
%scevgep = getelementptr i8, ptr %loadgep_2, i64 %17
%.sroa.0.0.copyload = load double, ptr %scevgep, align 8, !tbaa !9
%.sroa.8.0.scevgep.sroa_idx = getelementptr inbounds i8, ptr %scevgep, i64 8
%.sroa.8.0.copyload = load double, ptr %.sroa.8.0.scevgep.sroa_idx, align 8, !tbaa !9
%.sroa.12.0.scevgep.sroa_idx = getelementptr inbounds i8, ptr %scevgep, i64 16
%.sroa.12.0.copyload = load double, ptr %.sroa.12.0.scevgep.sroa_idx, align 8, !tbaa !9
%.sroa.16.0.scevgep.sroa_idx = getelementptr inbounds i8, ptr %scevgep, i64 24
%.sroa.16.0.copyload = load double, ptr %.sroa.16.0.scevgep.sroa_idx, align 8, !tbaa !9
br label %omp.wsloop.region8
```
* After the patch
```llvm
;; no difference
omp.wsloop.region10.preheader: ; preds = %omp_loop.body
%scevgep = getelementptr i8, ptr %loadgep_6, i64 %25
call void @llvm.memcpy.p0.p0.i64(ptr align 8 %4, ptr align 8 %scevgep, i64 32, i1 false), !tbaa !9 ;; stencil(:) = in3d(i-2:i+1, k)
br label %omp.wsloop.region12
```
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzsWF1v2zoS_TX0y8ACRVmK_eAHp7nZDdDu7XaLu9ingBLHFluKJEjabvrrF6TkzziJF2gvFsUNAjiRZ-acGQ7PiOTey5VGnJPylpR3I74OrXHzp7UvbnI6Xn4Z1UY8zR9gadZaQGh5gLIosy3y0KJ7DCA1_Ovjb-9a2wCjLAf8ZtFJ1A164Bqm2YywErwyW2G2GoyGf_zxcPewgL853iDwZUAHhBU5K28KlsHnVnqQHmps-NojGI1glhBaBGWMTd9pA8roFTrYYBOMk99RZPCwhN8t6g8fo4mQntcKBWHv9r7xuQ9SqSO3PQ6paIddY59IRXuMAA1XCgVsZWil3ofJgLDpgwYuhAzS6AgxAEfLGGncKbXpYDywGEe3sRTSdOMDyFLxFXDlDbSorN-T4jFoRtiM0AWhC4fWGbFu0JFiAW0I1pNiQdg9YfcrI2qjQmbcirD774Tdl5__Y9ovuah651tcGoeJuuWhaXfl4M7xJ-hXv0MdYGlcJO4D6kaqSG_LPTi0ijdDCc5KpOG9MfYhJvUJG7PS8ntcrQYjBk_Nghpq0zseBY7fnYTaokNw2JkNitRPn35fZPBvhIbr4flZiDZ67FZOhrheX9Y-xIbjSnIfO4ZUVOpCEDaVY0aKhSTsNo_cvsbSVjSDReq9Z7UxNshuWIYdHRAGvSbsJsDWuK-7FE_ScNhxqT2ctMpDAI_Y-X7rJCwnNzwgbLiTsTkGrkfZSQ9cKdPwgCL1oekj-sCbr1Cv9w9a5DZS2bayacE63KAO_lkSWd8Ln1uEpVHKbKVeRZD37__4AA-foO6bJJW9NyVsAc9ahy5IRfvf2N2ELgQupUaQOqDTXMHGSAFkQh__-SF17Ue_rrPMdPbRckfY1AYH2vQr5JALHTd3w21YO_SETeP_hM2AsDJIkXEhXEzvf3D7js6c-EV7o9XTC_a0_ywokJtbQhems5nlLkMd3FPcZnQB0U4ZLlZoHzOH8U8UQIo7iH9GlB1YH_Bd7MGVhml0Jqw88ctZ8lxhQIVx50U_OT2PIKtJ738RfAhyAf7M7IzMzuJqCnl1zuE14BfQ8qvh2OQc7tVE8-e1tpniPsg0U4q7YRuBLI4qscewmTJbdHWabK9aJ-O1tVcZ99Y-OCnwbcv8kKCsJkcJPlv0C9VlVztfbIXimB4pbyfwDYRZ1wpJeXfBfnKlPZDilhS3sJM0Nh00r59pkGXZsNW2Pqmkw5U0usqswxa56CfdpZ8Y1ToUPjEhrIzakkLE15QdT9_gZoX2rbbb99iu_WJFyxgkzv29mEWty3qVzyyNv6nWScz2CbNyX_6jZwORXfy-AWQOS658KgZ7B4TloeY8fs6eFy5N-lnK5JVhFvOuHSheoxqqclra6bF4D_p-Pv1-ZXlnf5a8s-IHyPsQ5C15Z8ULgns9h4Kdk5i8Bv0SYHk14ISeA1avAZYXAH-OxP8sjf_zJP5iOxx10sAtuR_28G_62zHjU0EqTpU_cKl6bRTocIkunvLiK-yjcY96raJgRZDZDqRLZYmiNajfBYU7HQ0vzYacvjkcfspsqJ7Nhl9vOOTsfDrQxUIL2PYHsBX255bD0SEeGZ4Ox4XrzgoD-yTWsbr-dc7_l28H-c2wpTPvDM9oRrPG2Ke0vfb7bHgjOkQ6Wujd7jxb2pOo04xmg0t68CjFt0tkddIpf8r6rKn2OnAIfQXhl3m8kcPBOWc_MIvDKeQo-NV5XKRy7WLk1Q_M43C8OQp-fR6XqLy1ID_hrXDYyNqAkMthCFwj2X_p8w_R55GYF2JWzPgI5_lNOalYNa1mo3ZeIG_EbCIms3xZ39QVv5lRzFHQkoomRzqSc0ZZSad0Sicsp7OsmDWT6ZIxgXk543lJJhQ7LlWWqmbcaiS9X-M8L1lVzEaJoU83xYylZigWx9dMnrBYI8LYUnG92v_XSe9RjI9NzyxjHNSdjY_Lu5Gbx-jjer3ycQmlD_5AKcig0nX1fXIt70h529--kvIO_o7c7i7PpNHpWvP8ws0fbsrS1d4xsdHaqfnZHasM7brOGtMRdp_S7j_G1pkv2ATC7lOZPGH3Q6U2c_bfAAAA___ftyg0">