<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/63176>63176</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            Fortran Array Descriptors in Tight Loop (flang-new)

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          AntonRydahl

      </td>

    </tr>

</table>

<pre>

    I have come across a small OpenMP example where the parallel code optimized at optimization level `-O3` contains many store operations in the tight loop. The issue is related with the array descriptors. Flang generates optimized code for the code example in the left column. Adding a second access to the allocatable array completely alters the IR. I compiled the two programs with `flang-new -fopenmp -O3 -emit-llvm -S`.

<table>

<tr>

<td> Expected Behavior </td> <td> Suspected Bug </td>

</tr>

<tr>

<td>

```fortran

PROGRAM array_descriptor

!-----------------------------------------------------------------------

! Minimal reproducible example of flang-new not generating duplicate 

! array descriptors with -fopenmp -O3

!-----------------------------------------------------------------------

IMPLICIT NONE

INTEGER(kind=4).                    :: length, i

REAL(kind=8), allocatable       :: arr(:)

length = 1024*1024

allocate (arr(length))

!$omp parallel do

do i=1,length

        arr(i) = 1.0/length

end do

!$omp end parallel do

write(*,100) "The result of (arr(1)+arr(",length,") is ", (arr(1)+arr(length))

100 format (A,I7,A,e13.6e2)

deallocate(arr)

END PROGRAM array_descriptor

```

</td>

<td>

```fortran

PROGRAM duplicate_array_descriptors

!-----------------------------------------------------------------------

! Minimal reproducible example of flang-new generating duplicate array 

! descriptors with -fopenmp -O3

!-----------------------------------------------------------------------

IMPLICIT NONE

INTEGER(kind=4)                     :: length, i

REAL(kind=8), allocatable       :: arr(:)

REAL(kind=8)                            :: tmp

length = 1024*1024

allocate (arr(length))

!$omp parallel do private(tmp)

do i=1,length

        arr(i) = 1.0/length

        tmp = arr(i)

end do

!$omp end parallel do

write(*,100) "The result of (arr(1)+arr(",length,") is ", (arr(1)+arr(length))

100 format (A,I7,A,e13.6e2)

deallocate(arr)

END PROGRAM duplicate_array_descriptors

```

</td>

</tr>

</table>

## LVM IR at `-O0`

When disabling optimizations, there are some differences in the LLVM IR that I do not understand how are related with the relatively small change in the program.

The IR generated from the program in the left column contains the following array descriptor in the tight loop:

```llvm

omp.wsloop.region:                                ; preds = %omp_loop.body

...

  %22 = load { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr @_QFEarr, align 8

  store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %22, ptr %loadgep_2, align 8

...

```

The example in the right column contains two array descriptors:

```llvm

omp.wsloop.region: ; preds = %omp_loop.body

...

  %23 = load { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr @_QFEarr, align 8

  store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %23, ptr %loadgep_2, align 8

...

  %36 = load { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr @_QFEarr, align 8

  store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %36, ptr %loadgep_4, align 8

...

```

## LLVM IR at `-O3`

At optimization level three, the array descriptor is completely eliminated the tight loop from the example program from the left column:

```llvm

vector.body: ; preds = %vector.body, %vector.ph

 %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]

  %22 = or i32 %index, 1

  %23 = add i32 %22, %3

  %24 = sext i32 %23 to i64

  %25 = add nsw i64 %24, -1

  %26 = getelementptr double, ptr %.unpack, i64 %25

  store <2 x double> %21, ptr %26, align 8, !tbaa !4, !alias.scope !13

  %index.next = add nuw i32 %index, 2

  %27 = icmp eq i32 %index.next, %n.vec

  br i1 %27, label %middle.block, label %vector.body, !llvm.loop !15

```

In case of the program from the right column, the optimizers do not eliminate the array descriptors which results in very inefficient code.

```llvm

omp_loop.body: ; preds = %omp_loop.body.lr.ph, %omp_loop.body

 %omp_loop.iv66 = phi i32 [ 0, %omp_loop.body.lr.ph ], [ %5, %omp_loop.body ]

  %5 = add nuw i32 %omp_loop.iv66, 1

  %6 = add i32 %5, %3

  %7 = load i32, ptr %loadgep_, align 4, !tbaa !4

  %8 = sitofp i32 %7 to float

  %9 = fdiv contract float 1.000000e+00, %8

  %10 = fpext float %9 to double

  store ptr %.unpack, ptr %loadgep_2, align 8, !tbaa !8

 store i64 8, ptr %loadgep_2.repack17, align 8, !tbaa !8

  store i32 20180515, ptr %loadgep_2.repack19, align 8, !tbaa !8

  store i8 1, ptr %loadgep_2.repack21, align 4, !tbaa !8

  store i8 28, ptr %loadgep_2.repack23, align 1, !tbaa !8

  store i8 2, ptr %loadgep_2.repack25, align 2, !tbaa !8

  store i8 0, ptr %loadgep_2.repack27, align 1, !tbaa !8

  store i64 1, ptr %loadgep_2.repack29, align 8, !tbaa !8

  store i64 %.unpack12.unpack.unpack14, ptr %loadgep_2.repack29.repack31, align 8, !tbaa !8

  store i64 8, ptr %loadgep_2.repack29.repack33, align 8, !tbaa !8

 %11 = sext i32 %6 to i64

  %12 = add nsw i64 %11, -1

  %13 = getelementptr double, ptr %.unpack, i64 %12

  store double %10, ptr %13, align 8, !tbaa !4

  store ptr %.unpack, ptr %loadgep_4, align 8, !tbaa !8

  store i64 8, ptr %loadgep_4.repack47, align 8, !tbaa !8

  store i32 20180515, ptr %loadgep_4.repack49, align 8, !tbaa !8

  store i8 1, ptr %loadgep_4.repack51, align 4, !tbaa !8

  store i8 28, ptr %loadgep_4.repack53, align 1, !tbaa !8

  store i8 2, ptr %loadgep_4.repack55, align 2, !tbaa !8

  store i8 0, ptr %loadgep_4.repack57, align 1, !tbaa !8

  store i64 1, ptr %loadgep_4.repack59, align 8, !tbaa !8

  store i64 %.unpack12.unpack.unpack14, ptr %loadgep_4.repack59.repack61, align 8, !tbaa !8

  store i64 8, ptr %loadgep_4.repack59.repack63, align 8, !tbaa !8

 %exitcond.not = icmp eq i32 %omp_loop.iv66, %reass.sub

  br i1 %exitcond.not, label %omp_loop.exit, label %omp_loop.body

```

At first glance, it looked like the many store instructions in the above examples came from `InstCombinePass`, but the optimization pass is just simplifying the `store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }` instructions.

I am in doubt about whether this is a frontend or optimizer related issue, but I think the fact that the unoptimized intermediate representation is almost identical for the two examples points towards the latter.

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsWltv47gO_jXqC1HDlhMneehDelsEmNvpDs55HCg2E2tHlnwkJWnPrz-QFF8Se3uZzgCLxQaDJrbIjyLFj6LsYcbwrUS8ItNrMr29YDtbKn21lFbJh6eCleJirYqnqxWUbI-QqwqB5VoZAwxMxYSAzzXKj18AH1lVC4RDiRrBlgg100wIFJCrAkHVllf8f1gAs80Fs1xJELhHASSLLz-nJIshV9IyLg1UTD6BsUo7bdRe2gCXHt3ybWlBKFVH8LVE4Mbs3F_QKJjFAg7cll6Sac2eoECTa15bpU0E94LJLWxROlQ0vcn5uW6U9pr-onHsaFfgxkKuxK6SESyLgsutCwXmShbA8hyNAauCYSFUzixbi2YSuXJQFsUTMGFRGy-3eohg5ce4wCJ4d1BQa7XVrDLBE5LFGzfrS4kHuNyoGmVVw-XnFC6x4vZSiH0Fl7-TLI5IfEviJUlvvGmS3nU39MlVQdI7uHusMXcBu8aS7bnSQNIbQu_DaCv2-840crttX6SFczf0C9aOV-FvFod_G6WtZjLc_fLw-beH5ccQsG_dqh11aHL5cz4tHnzkkldMgMZaq2KXc7dezaqrDXRhl8o2SeOWvdjVgufMInRgg2QLq9dfsF_jyerjlw-rm9VX-PT50zHMq09f7367eyB0_p3LgqS3E0IXEYx8SLok6RIEyq0tCb0BHhAe7pYfOvU5oQs32M_rE32mNaFzd0EXASAgAklvIYnphNCl__JjRxgEQudBs7G_aPUJTQidqKruqkmhwlChgJP0NiH05qh3zK1FAOOELoLhKCb0vi-DsmhhOgvu7sDKQXOLzim6JPQmiWOPSqkrORrNTliXIq0HiZ_89TEQlHaTozf-cuFKVBj4E61hEJI4diWpYtapuHmsZoTeuB-YpFGGtBUtsIlqA34cuPt0Cy9Qq6HjCUdHeP5GIrcs-XZu1_wVOD3K58DiDu4vzOcxOv86Pg8BRu2fwtiq_oXVAGrN9yHhnaGWCT9eHprp26r2Ep3CP-XjVYR-bSUZdAzuxmnPcgxySmgKH_79EVYPrn_0zWJn4T8lSii4YWvheNzvLo2LlPU9KdMIxjWwBd9sUKPMse0mPxyxbcksrFxWub1-JwvUxjJZQKkOHmDQXfobfO9autAP5yWT27ZdPDZxUd-dr77na9vPAjZaVX3pkV6z64rdwEYJoQ6-9TzrN4btsSPh6bK4VjHcUlUdHYxvojVuuZKOri98SHoNtcbCeG4QOlVV_c1DuJNCwI2iqKESoVNKvahQrAAyu4baal-Tson_Sn028_n4XzK9TuDRfaXw6HWmt2R6C2R264Zrq4FM4m__ur_z-eqqGd9KmDf2w_nhZ5oNLrXG6dQ5tsX6Gx2Yb-NwRgmXAWfnCu0XbLDYBzVsKd-8oj-wZOnfb8nSty2Zj0Oa_e3ikGYjcZi8OnWbWnxWjNNWYDl6vLelRjzW4pGiZfpHYxS84tJXxtNS1hXKhjxNwWwHeiXzOZrsMbdKh_QfI0h_3G-1zZ266REInXJZ4KPXqEvuVsXFHOJzBXDxD-vSakUSH-2poLPlRQeF0wXIgR91nVoy5Corikas6Q-maV9s4sUMPtpWLgWrfIb0xKYtmjQHNxiUHeLlidlAja1bM6xQWpdShdq57btLsGgna5Z_P2ZvMHCW3-kNhcdGM73zMkkPgmb95PSOJXbNmPueHK-Z4MxEJle1ayCTpO94F_DOs91hEFLa923mRXnu-rr_noj2V05Ge8wbtbUGngRlNyzYGoW7rHhRCIzWQoUwtAODHEtcfkY-0Z0P01H-rSTkzPhDTL9jaAnQ30cavjWPt7RpOpuWYuNPyOBQ8rw8dqi-SdqjfgIucbPhOUdp_bOx8wpxsg31NpgXt6BIeGaFqI5sTie3-T7L_pR1I7Bn_JuOSZ4RbzqWKScTOCdhds7B6QgFZ91mcqzhZ2W4S_TJINE7mHkgMrdqUzfmZo7JG6GY7QmGA86m4HvfUWiW2yDjDj3-g4Rex03s5j3VJA66taNN0PGAVjVMPSHxkO3PbbSnrjVmA5KrEvMxgEijw05mLyM1UCkFGifzeJpMn0NcvB5xDskzSKFsjS_gEIk-52boVgJU8jLUc0jTDom-iBQ_hzR7_ZyyyfORekPMw7ZxTK2EHn8015NnrRx_pMmbzD27MC1m-jKmI1Iy2Haz4a6b0LFdN0kGu26S_tium9BTN4NKYHpPMXnGqckbKT95d8wnx1BPfhrnW8R3c75Bmr6f8y3UuznfIr2b8y3S-znfQv1SzndWjj-y93N-iPk6zuMjt7mSReRarZFGctBJEDrVyIyJzG496Cf7aCcNZAvjJMZHuhbqrI1cWthwbSxsBZO5rx3cH7K-YwGCfw-NYe8NLJfG6l1-8g6WrdW-PYsZyFmFoRElWbySxt6oas0lfmHGOMP0BtY72-9IwwmxZsa4I-AfO2PB8KoWfPPE5dZLkiz--edgksUnDp08nltBeALnSqR1Pu4sHEq0JWqwJfdTZc5PaVEW7nTW9tft40H_NrpxeOXU5Pfw5M71YP4po7vaye7NM5cWdYUFd125xlqjQWlDhJxFUSljgRcoLc-ZaF9S24PqlqBWXFoDVh2YLsKzQsGsRR1dFFdpsUgX7AKvkmyeJbN0MY8vyqsM58Ua6SZjmCYJYp7l6YTGeco2dDqbzy74FY1pGmfxjCbpYhJHcT6fr2eTWTxnKRb5hExirBgXkT-_KL298O5fZWkyyy58UprmPxjoKyd0ud5tDZnEghtrOjXLrcCr-_AWCZb-VHLbO5VwCV_98eZDOCPN29c4hC4udlpcldbW_vkYvSf0fsttuVtHuaoIvfeHk_B1WWv1B-aW0Hs_U0PovZ_s_wMAAP__WNl_gw">