<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/63176>63176</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
Fortran Array Descriptors in Tight Loop (flang-new)
</td>
</tr>
<tr>
<th>Labels</th>
<td>
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
AntonRydahl
</td>
</tr>
</table>
<pre>
I have come across a small OpenMP example where the parallel code optimized at optimization level `-O3` contains many store operations in the tight loop. The issue is related with the array descriptors. Flang generates optimized code for the code example in the left column. Adding a second access to the allocatable array completely alters the IR. I compiled the two programs with `flang-new -fopenmp -O3 -emit-llvm -S`.
<table>
<tr>
<td> Expected Behavior </td> <td> Suspected Bug </td>
</tr>
<tr>
<td>
```fortran
PROGRAM array_descriptor
!-----------------------------------------------------------------------
! Minimal reproducible example of flang-new not generating duplicate
! array descriptors with -fopenmp -O3
!-----------------------------------------------------------------------
IMPLICIT NONE
INTEGER(kind=4). :: length, i
REAL(kind=8), allocatable :: arr(:)
length = 1024*1024
allocate (arr(length))
!$omp parallel do
do i=1,length
arr(i) = 1.0/length
end do
!$omp end parallel do
write(*,100) "The result of (arr(1)+arr(",length,") is ", (arr(1)+arr(length))
100 format (A,I7,A,e13.6e2)
deallocate(arr)
END PROGRAM array_descriptor
```
</td>
<td>
```fortran
PROGRAM duplicate_array_descriptors
!-----------------------------------------------------------------------
! Minimal reproducible example of flang-new generating duplicate array
! descriptors with -fopenmp -O3
!-----------------------------------------------------------------------
IMPLICIT NONE
INTEGER(kind=4) :: length, i
REAL(kind=8), allocatable :: arr(:)
REAL(kind=8) :: tmp
length = 1024*1024
allocate (arr(length))
!$omp parallel do private(tmp)
do i=1,length
arr(i) = 1.0/length
tmp = arr(i)
end do
!$omp end parallel do
write(*,100) "The result of (arr(1)+arr(",length,") is ", (arr(1)+arr(length))
100 format (A,I7,A,e13.6e2)
deallocate(arr)
END PROGRAM duplicate_array_descriptors
```
</td>
</tr>
</table>
## LVM IR at `-O0`
When disabling optimizations, there are some differences in the LLVM IR that I do not understand how are related with the relatively small change in the program.
The IR generated from the program in the left column contains the following array descriptor in the tight loop:
```llvm
omp.wsloop.region: ; preds = %omp_loop.body
...
%22 = load { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr @_QFEarr, align 8
store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %22, ptr %loadgep_2, align 8
...
```
The example in the right column contains two array descriptors:
```llvm
omp.wsloop.region: ; preds = %omp_loop.body
...
%23 = load { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr @_QFEarr, align 8
store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %23, ptr %loadgep_2, align 8
...
%36 = load { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr @_QFEarr, align 8
store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %36, ptr %loadgep_4, align 8
...
```
## LLVM IR at `-O3`
At optimization level three, the array descriptor is completely eliminated the tight loop from the example program from the left column:
```llvm
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%22 = or i32 %index, 1
%23 = add i32 %22, %3
%24 = sext i32 %23 to i64
%25 = add nsw i64 %24, -1
%26 = getelementptr double, ptr %.unpack, i64 %25
store <2 x double> %21, ptr %26, align 8, !tbaa !4, !alias.scope !13
%index.next = add nuw i32 %index, 2
%27 = icmp eq i32 %index.next, %n.vec
br i1 %27, label %middle.block, label %vector.body, !llvm.loop !15
```
In case of the program from the right column, the optimizers do not eliminate the array descriptors which results in very inefficient code.
```llvm
omp_loop.body: ; preds = %omp_loop.body.lr.ph, %omp_loop.body
%omp_loop.iv66 = phi i32 [ 0, %omp_loop.body.lr.ph ], [ %5, %omp_loop.body ]
%5 = add nuw i32 %omp_loop.iv66, 1
%6 = add i32 %5, %3
%7 = load i32, ptr %loadgep_, align 4, !tbaa !4
%8 = sitofp i32 %7 to float
%9 = fdiv contract float 1.000000e+00, %8
%10 = fpext float %9 to double
store ptr %.unpack, ptr %loadgep_2, align 8, !tbaa !8
store i64 8, ptr %loadgep_2.repack17, align 8, !tbaa !8
store i32 20180515, ptr %loadgep_2.repack19, align 8, !tbaa !8
store i8 1, ptr %loadgep_2.repack21, align 4, !tbaa !8
store i8 28, ptr %loadgep_2.repack23, align 1, !tbaa !8
store i8 2, ptr %loadgep_2.repack25, align 2, !tbaa !8
store i8 0, ptr %loadgep_2.repack27, align 1, !tbaa !8
store i64 1, ptr %loadgep_2.repack29, align 8, !tbaa !8
store i64 %.unpack12.unpack.unpack14, ptr %loadgep_2.repack29.repack31, align 8, !tbaa !8
store i64 8, ptr %loadgep_2.repack29.repack33, align 8, !tbaa !8
%11 = sext i32 %6 to i64
%12 = add nsw i64 %11, -1
%13 = getelementptr double, ptr %.unpack, i64 %12
store double %10, ptr %13, align 8, !tbaa !4
store ptr %.unpack, ptr %loadgep_4, align 8, !tbaa !8
store i64 8, ptr %loadgep_4.repack47, align 8, !tbaa !8
store i32 20180515, ptr %loadgep_4.repack49, align 8, !tbaa !8
store i8 1, ptr %loadgep_4.repack51, align 4, !tbaa !8
store i8 28, ptr %loadgep_4.repack53, align 1, !tbaa !8
store i8 2, ptr %loadgep_4.repack55, align 2, !tbaa !8
store i8 0, ptr %loadgep_4.repack57, align 1, !tbaa !8
store i64 1, ptr %loadgep_4.repack59, align 8, !tbaa !8
store i64 %.unpack12.unpack.unpack14, ptr %loadgep_4.repack59.repack61, align 8, !tbaa !8
store i64 8, ptr %loadgep_4.repack59.repack63, align 8, !tbaa !8
%exitcond.not = icmp eq i32 %omp_loop.iv66, %reass.sub
br i1 %exitcond.not, label %omp_loop.exit, label %omp_loop.body
```
At first glance, it looked like the many store instructions in the above examples came from `InstCombinePass`, but the optimization pass is just simplifying the `store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }` instructions.
I am in doubt about whether this is a frontend or optimizer related issue, but I think the fact that the unoptimized intermediate representation is almost identical for the two examples points towards the latter.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsWltv47gO_jXqC1HDlhMneehDelsEmNvpDs55HCg2E2tHlnwkJWnPrz-QFF8Se3uZzgCLxQaDJrbIjyLFj6LsYcbwrUS8ItNrMr29YDtbKn21lFbJh6eCleJirYqnqxWUbI-QqwqB5VoZAwxMxYSAzzXKj18AH1lVC4RDiRrBlgg100wIFJCrAkHVllf8f1gAs80Fs1xJELhHASSLLz-nJIshV9IyLg1UTD6BsUo7bdRe2gCXHt3ybWlBKFVH8LVE4Mbs3F_QKJjFAg7cll6Sac2eoECTa15bpU0E94LJLWxROlQ0vcn5uW6U9pr-onHsaFfgxkKuxK6SESyLgsutCwXmShbA8hyNAauCYSFUzixbi2YSuXJQFsUTMGFRGy-3eohg5ce4wCJ4d1BQa7XVrDLBE5LFGzfrS4kHuNyoGmVVw-XnFC6x4vZSiH0Fl7-TLI5IfEviJUlvvGmS3nU39MlVQdI7uHusMXcBu8aS7bnSQNIbQu_DaCv2-840crttX6SFczf0C9aOV-FvFod_G6WtZjLc_fLw-beH5ccQsG_dqh11aHL5cz4tHnzkkldMgMZaq2KXc7dezaqrDXRhl8o2SeOWvdjVgufMInRgg2QLq9dfsF_jyerjlw-rm9VX-PT50zHMq09f7367eyB0_p3LgqS3E0IXEYx8SLok6RIEyq0tCb0BHhAe7pYfOvU5oQs32M_rE32mNaFzd0EXASAgAklvIYnphNCl__JjRxgEQudBs7G_aPUJTQidqKruqkmhwlChgJP0NiH05qh3zK1FAOOELoLhKCb0vi-DsmhhOgvu7sDKQXOLzim6JPQmiWOPSqkrORrNTliXIq0HiZ_89TEQlHaTozf-cuFKVBj4E61hEJI4diWpYtapuHmsZoTeuB-YpFGGtBUtsIlqA34cuPt0Cy9Qq6HjCUdHeP5GIrcs-XZu1_wVOD3K58DiDu4vzOcxOv86Pg8BRu2fwtiq_oXVAGrN9yHhnaGWCT9eHprp26r2Ep3CP-XjVYR-bSUZdAzuxmnPcgxySmgKH_79EVYPrn_0zWJn4T8lSii4YWvheNzvLo2LlPU9KdMIxjWwBd9sUKPMse0mPxyxbcksrFxWub1-JwvUxjJZQKkOHmDQXfobfO9autAP5yWT27ZdPDZxUd-dr77na9vPAjZaVX3pkV6z64rdwEYJoQ6-9TzrN4btsSPh6bK4VjHcUlUdHYxvojVuuZKOri98SHoNtcbCeG4QOlVV_c1DuJNCwI2iqKESoVNKvahQrAAyu4baal-Tson_Sn028_n4XzK9TuDRfaXw6HWmt2R6C2R264Zrq4FM4m__ur_z-eqqGd9KmDf2w_nhZ5oNLrXG6dQ5tsX6Gx2Yb-NwRgmXAWfnCu0XbLDYBzVsKd-8oj-wZOnfb8nSty2Zj0Oa_e3ikGYjcZi8OnWbWnxWjNNWYDl6vLelRjzW4pGiZfpHYxS84tJXxtNS1hXKhjxNwWwHeiXzOZrsMbdKh_QfI0h_3G-1zZ266REInXJZ4KPXqEvuVsXFHOJzBXDxD-vSakUSH-2poLPlRQeF0wXIgR91nVoy5Corikas6Q-maV9s4sUMPtpWLgWrfIb0xKYtmjQHNxiUHeLlidlAja1bM6xQWpdShdq57btLsGgna5Z_P2ZvMHCW3-kNhcdGM73zMkkPgmb95PSOJXbNmPueHK-Z4MxEJle1ayCTpO94F_DOs91hEFLa923mRXnu-rr_noj2V05Ge8wbtbUGngRlNyzYGoW7rHhRCIzWQoUwtAODHEtcfkY-0Z0P01H-rSTkzPhDTL9jaAnQ30cavjWPt7RpOpuWYuNPyOBQ8rw8dqi-SdqjfgIucbPhOUdp_bOx8wpxsg31NpgXt6BIeGaFqI5sTie3-T7L_pR1I7Bn_JuOSZ4RbzqWKScTOCdhds7B6QgFZ91mcqzhZ2W4S_TJINE7mHkgMrdqUzfmZo7JG6GY7QmGA86m4HvfUWiW2yDjDj3-g4Rex03s5j3VJA66taNN0PGAVjVMPSHxkO3PbbSnrjVmA5KrEvMxgEijw05mLyM1UCkFGifzeJpMn0NcvB5xDskzSKFsjS_gEIk-52boVgJU8jLUc0jTDom-iBQ_hzR7_ZyyyfORekPMw7ZxTK2EHn8015NnrRx_pMmbzD27MC1m-jKmI1Iy2Haz4a6b0LFdN0kGu26S_tium9BTN4NKYHpPMXnGqckbKT95d8wnx1BPfhrnW8R3c75Bmr6f8y3UuznfIr2b8y3S-znfQv1SzndWjj-y93N-iPk6zuMjt7mSReRarZFGctBJEDrVyIyJzG496Cf7aCcNZAvjJMZHuhbqrI1cWthwbSxsBZO5rx3cH7K-YwGCfw-NYe8NLJfG6l1-8g6WrdW-PYsZyFmFoRElWbySxt6oas0lfmHGOMP0BtY72-9IwwmxZsa4I-AfO2PB8KoWfPPE5dZLkiz--edgksUnDp08nltBeALnSqR1Pu4sHEq0JWqwJfdTZc5PaVEW7nTW9tft40H_NrpxeOXU5Pfw5M71YP4po7vaye7NM5cWdYUFd125xlqjQWlDhJxFUSljgRcoLc-ZaF9S24PqlqBWXFoDVh2YLsKzQsGsRR1dFFdpsUgX7AKvkmyeJbN0MY8vyqsM58Ua6SZjmCYJYp7l6YTGeco2dDqbzy74FY1pGmfxjCbpYhJHcT6fr2eTWTxnKRb5hExirBgXkT-_KL298O5fZWkyyy58UprmPxjoKyd0ud5tDZnEghtrOjXLrcCr-_AWCZb-VHLbO5VwCV_98eZDOCPN29c4hC4udlpcldbW_vkYvSf0fsttuVtHuaoIvfeHk_B1WWv1B-aW0Hs_U0PovZ_s_wMAAP__WNl_gw">