[flang-commits] [flang] [RFC][flang] Add support for assumed-shape dummy arrays repacking. (PR #127147)

Wed Feb 19 03:41:23 PST 2025

================
@@ -0,0 +1,498 @@
+<!--===- docs/ArrayRepacking.md
+
+   Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+   See https://llvm.org/LICENSE.txt for license information.
+   SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+-->
+
+# Assumed-shape arrays repacking
+Fortran 90 introduced dummy arguments to be declared as assumed-shape arrays, which allowed to pass non-contiguous arrays to subprograms. In some cases, accessing non-contiguous arrays may result in poor program performance, and paying an overhead of copying a non-contiguous array into a contiguous memory (packing) before processing it may result in better performance. This document describes Flang compiler and runtime support for packing/unpacking of non-contiguous arrays.
+
+## A problem case
+
+[Example #1](#example-1) provides a way to compare performance of a repetitive access of a large array when the array is contiguous and non-contiguous. The `test` function remains the same in both cases to make sure that any difference in the code generation does not affect performance, and only the array layout in memory matters.
+
+The example might be compiled using any Fortran 90 compiler, e.g. `gfortran -cpp example1.f90 -O2 <additional-options>`. The table below contains performance information for different compilations and targets:
+
+| additional-options | AMD EPYC 9684X, GNU Fortran 13.2.0                           | Arm Neoverse V2, GNU Fortran 11.4.0                          |
+| ------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| `none`             | 20495466758      L1-dcache-loads<br/>            31403868      L1-dcache-prefetches<br/>     10116173649      L1-dcache-load-misses<br/>119,167,236,596      cycles | 20030549910      L1-dcache-loads<br/>   10233598442      L1-dcache-load-misses<br/>     1098681496      LLC-load-misses<br/>43,426,056,799      cycles |
+| `-DREPACKING`      | 20245847735      L1-dcache-loads<br/>         583614040      L1-dcache-prefetches<br/>         644552282      L1-dcache-load-misses<br/>  10,837,843,298      cycles | 20023110457      L1-dcache-loads<br/>              294393      L1-dcache-load-misses<br/>              321878      LLC-load-misses<br/>10,065,421,618      cycles |
+| `-frepack-arrays`  | 20248624699      L1-dcache-loads<br/>         584325700      L1-dcache-prefetches<br/>         644353154      L1-dcache-load-misses<br/>  10,850,830,504      cycles | 20023117997      L1-dcache-loads<br/>              275169      L1-dcache-load-misses<br/>              323902      LLC-load-misses<br/>10,066,689,166      cycles |
+
+The default version is much slower than the version with manual array repacking and [-frepack-arrays](https://gcc.gnu.org/onlinedocs/gfortran/Code-Gen-Options.html#index-frepack-arrays) due to the L1 data cache misses even considering the extra overhead required to pack/unpack the non-contiguous array.
+
+This artificial example was inspired by the innermost hot loop from `fourir` subroutine of [Polyhedron/capacita](https://fortran.uk/fortran-compiler-comparisons/the-polyhedron-solutions-benchmark-suite/) benchmark, which speeds up about 1.8x with GNU Fortran compiler's `-frepack-arrays` option on AMD EPYC 9684X and 1.3x - on Arm Neoverse V2.
+
+Having these results it seems reasonable to provide support for arrays repacking in Flang compiler, which may reduce the amount of effort to rewrite existing Fortran programs for better data cache utilization.
+
+## Implementations in other compilers
+
+### GNU Fortran compiler
+
+`-frepack-arrays` option of GNU Fortran compiler let's the compiler generate special subprogram prologue/epilogue code that performs automatic packing/unpacking of the assumed-shape dummy arrays. With some implementation limitations, the following happens for any such dummy array:
+
+**In subprogram prologue:** iff the array is not contiguous in any dimension, it is copied into a newly allocated contiguous chunk of memory, and the following subprogram code operates on the temporary. This is a `pack` action consisting of the allocation and copy-in.
+
+**In subprogram epilogue:** iff the array is not contiguous in any dimension, values from the temporary array are copied over to the original array and the temporary array is deallocated. This is `unpack` action consisting of the copy-out and deallocation. It makes sure any updates of the array done by the subprogram are propagated to the caller side.
+
+#### Facts and guesses about the implementation
+
+The dynamic checks for continuity and the array copy code is located completely in the [runtime](https://github.com/gcc-mirror/gcc/blob/3e08a4ecea27c54fda90e8f58641b1986ad957e1/libgfortran/generated/in_pack_r8.c#L35), so the compiler inserts unconditional calls in the subprogram prologue/epilogue.
+
+It looks like `gfortran` ignores `intent(out)/intent(in)` which could have helped to avoid some of the `pack/unpack` overhead.
+
+It looks like the `pack`/`unpack` actions are inserted early in the compilation pipeline, and these extra calls affect behavior of the later optimization passes. For example, `Polyhedron/fatigue2` slows down by about 2x with `-frepack-arrays`: this slowdown is not caused by the `pack`/`unpack` overhead, but is a consequence of worse function inlining decisions made after the calls insertion. The benchmarks becomes even faster than the original version with `-frepack-arrays` and proper `-finline-limit=` settings, but it does not look like the benchmark contains code that would benefit from the array repacking.
+
+It does not look like `gfortran` is able to eliminate the `pack`/`unpack` code after the function inlining, if the actual argument is statically known to be contiguous. So the overhead from the dynamic continuity checks is inevitable when `-frepack-arrays` is specified.
+
+It does not look like `gfortran` tries to optimize the insertion of `pack`/`unpack` code. For example, if a dummy array is only used under a condition within the subprogram, the repacking code might be inserted under the same condition to minimize the overhead on the unconditional path through the subprogram.
+
+### NVIDIA HPC Fortran compiler
+
+`nvfortran` compiler performs array repacking by default, and has few option to control this behavior (only [-M[no]target_temps](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-ref-guide/index.html#command-line-options-reference)). The compiler inserts `pack`/`unpack` code around the calls of subprograms that have assumed-shape dummy array arguments (a procedure having an assumed-shape dummy argument must have an explicit interface due to F2018 15.4.2.2, 1, (3), (b)):
+
+**Before the call:** iff the array is not contiguous in the innermost dimension, it is copied into a newly allocated contiguous chunk of memory, and the temporary array is passed to the callee.
+
+**After the call:** iff the array is not contiguous in the innermost dimension, values from the temporary array are copied to the original array and the temporary array is deallocated.
+
+#### Facts and guesses about the implementation
+
+The `pack` code is only generated if the actual argument may be non-contiguous in the innermost dimension, as determined statically, i.e. the compiler does not generate any dynamic continuity checks. For example:
+
+```Fortran
+interface
+  subroutine test1(x)
+    real :: x(:)
+  end subroutine test1
+  subroutine test2(x)
+    real :: x(:,:)
+  end subroutine test2
+end interface
+real :: x(m1,m2), y(1,m2)
+call test1(x(1,:)) ! case 1
+call test1(y(1,:)) ! case 2
+call test2(x(1:n,:)) ! case 3
+call test2(x(1:1,:)) ! case 4
+```
+
+In case 1, the `pack`/`unpack` code is generated without dynamically checking if `m1 == 1` (in which case the actual argument is actually contiguous).
+
+In case 2, the `pack`/`unpack` code is also generated, which is a room for improvement.
+
+In case 3 and 4, the `pack`/`unpack` code is not generated, because the actual argument is contiguous in the innermost dimension. There seems to be room for improvement in case 4, where it might be beneficial to repack the array in case `m1` is big enough to prevent the data cache utilization (depending on the actual processing of the array in `test2`, of course).
+
+`nvfortran` does optimize out the `unpack` copy-out code in case the dummy argument is declared `intent(in)`, but it does not optimize the `pack` copy-in in case it is declared `intent(out)`.
+
+It looks like `nvfortran` is not able to optimize the `pack`/`unpack` code after the function inlining (`-Minline=reshape`), even if the inline code makes it obvious that only a single element of the array is being accessed and there is no reason to copy-in/out the whole array.
+
+`nvfortran`'s implementation guarantees that an assumed-shape dummy array is contiguous in the innermost dimension, so when such a dummy is passed to a callee as an actual argument associated with the callee's assumed-shape dummy array, there is no need to `pack`/`unpack` it again around the callee's call site.
+
+## Known limitations of the array repacking
+
+`gfortran` documentation, expectedly, warns that the array repacking `can introduce significant overhead to the function call, especially when the passed data is noncontiguous`. A compiler has to try to minimize the overhead of the copy-in/out actions whenever possible, but it may not be always possible to guess correctly when the repacking is profitable. So the `gfortran`'s approach of giving the users control over the repacking seems reasonable. A compiler may decide to enable array repacking by default or under some optimization levels, but the correctness issues described below has to be taken into account as well as performance and usability (i.e. the need to specify a compiler option to enable/disable array repacking).
+
+**Difference between performance of nvfortran and gfortran **
+
+Array repacking creates a complete copy of an array section and let's the program code work on the temporary copy, then reflecting the updates back through another copy. If the original program intends to let different threads to work on different parts of the same array section, then the copy-in/out actions introduce a data race that has not existed in the original program. [Example #2](#example-2) produces inconsistent results when being compiled with either `nvfortran -mp` or `gfortran -fopenmp -frepack-arrays` and run with multiple threads. Note that the `repacking` subroutine and its call site might be written such that they are located in separate modules that do not have to be compiled with `-mp/-fopenmp`, so a compiler has no clue whether array repacking is safe. Even if explicitly instructed via `-frepack-arrays`, the compiler cannot avoid false-positive warnings about unsafety of array repacking, because it cannot know whether a function might be called in a multithreaded context (e.g. when `-mp/-fopenmp` is not specified).
+
+The array copies may also become a problem for OpenACC/OpenMP target data environment management. For example:
+
+```Fortran
+subroutine test(x)
+  real :: x(:)
+  !$acc serial present(x)
+  ...
+  !$acc end serial
+end subroutine test
+subroutine caller(n)
+  integer :: n
+  real :: x(n,n)
+  !$acc enter data create(x)
+  call test(x(1,:))
+end subroutine caller
+```
+
+The whole array `x` is expected to be present in the device data environment after the `enter data` construct, but the actual array being "seen" at the `serial` construct is a temporary copy of the array section, which has no corresponding memory on the device.
+
+A compiler could generate code that dynamically detect both of these situations, i.e. whether the point of repacking is happening in a multithreaded context or whether the array to be repacked has associated bookkeeping in the device data environment, and do not create copies. Such checks would introduce dependencies on the parallelization/offload runtime libraries, which are not linked unless compiler is instructed to do so via `-acc/-fopenacc/-mp/-fopenmp/etc.`
+
+So it does not seem practical/reasonable to enable the array repacking by default in a compiler that must produce correct code for all standard conformant programs. It is still beneficial to let users request array repacking, given that its behavior is properly documented and all the warning signs are in place.
+
+## Flang feature requirements
+
+### Correctness
+
+1. Support repacking of assumed-shape array dummy arguments or actual array arguments associated with such dummy arguments of any data types.
+2. When array repacking is enabled, Flang should guarantee correct program behavior when OpenACC/OpenMP features are explicitly enabled during the compilation.
+   * [TBD] not sure if it is always possible to prevent runtime issues, especially, for programs with target offload.
+
+### Performance
+
+1. Minimize the overhead of array repacking, e.g. avoid copy-in/out whenever possible, execute copy-in/out only on the execution paths where the array is accessed.
+2. Provide different modes of repacking depending on the "continuity" meaning, i.e. one - array is contiguous in the innermost dimension, two - array is contiguous in all dimensions.
+3. Avoid generating repacking code, when the "continuity" can be statically proven (including after optimization passes like constant propagation, function inlining, etc.).
+4. Use a set of heuristics to avoid generating repacking code based on the array usage pattern, e.g. if an array is proven not to be used in an array expression or a loop, etc.
+5. Use a set of heuristics to avoid repacking actions dynamically, e.g. based on the array size, element size, byte stride(s) of the [innermost] dimension(s), etc.
+6. Minimize the impact of the IR changes, introduced by repacking, on the later optimization passes.
+
+### Usability
+
+1. Provide command line options to enable/disable array repacking, e.g. `-f[no-]repack-arrays` for `gfortran` cli compatibility.
+2. Provide command line options to instruct the compiler which performance heuristics to use with the default picked based on benchmarking.
+3. Provide consistent behavior of the temporary arrays with relation to `-fstack-arrays` (that forces all temporary arrays to be allocated on the stack).
+4. Produce correct debug information to substitute the original array with the copy array when accessing values in the debugger.
+5. Document potential correctness issues that array repacking may cause in multithreaded/offload execution.
+
+## Proposed design
+
+### Overview
+
+Controlled by cli options, Lowering will generate a `fir.pack_array` operation in a subprogram's prologue for each assumed-shape dummy array argument (including `OPTIONAL`). For each `fir.pack_array` it will also generate `fir.unpack_array` in the subprogram's epilogue. These new operations will represent the complete effects of `pack`/`unpack` actions, such as temp-allocation/copy-in/copy-out/temp-deallocation. While it is possible to represent the needed actions using existing FIR/HLFIR operations, it is worth keeping them more specific and compact for easier manipulation in the passes related to optimizing the `pack`/`unpack` actions.
+
+The new operations will hold all the information that customizes further handling of the `pack`/`unpack` actions, such as:
+
+* Optional array of attributes supporting an interface to generate a predicate that says if the repacking is safe in the current context.
+* The continuity mode: `innermost` vs `whole`.
+* Attributes selecting the heuristics (both compiler and runtime ones) that may be applied to avoid `pack`/`unpack` actions.
+* Other attributes, like `stack` vs `heap` to manage the temporary allocation according to `-fstack-arrays`, etc.
+
+Lowering will not try to optimize the insertion of new operations, except for obvious cases like `CONTIGUOUS` dummy arguments or arrays of elements bigger than the element size threshold. Further optimization passes will be responsible for optimizing the operations away or moving them around to satisfy the performance requirements.
+
+The following FIR passes should be implemented:
+
+* Deletion of `fir.pack_array`/`fir.unpack_array` that are statically proven to take a contiguous input array.
+* Deletion/merging of cascaded `fir.pack_array` operations.
+* Deletion of the new operations that are statically proven not to meet the array usage patterns that are considered to benefit from the array repacking.
+* Deletion of the new operations that are statically proven not to meet the dynamic conditions for repacking (such as the array size).
+* Repositioning of `fir.pack_array`/`fir.unpack_array` to execution paths where the array is actually accessed.
+* A pass converting the operations to the existing FIR operations and/or Fortran runtime calls.
+
+### New operations to represent pack/unpack
+
+#### fir.pack_array operation
+
+The operation has the following syntax:
+
+```
+%new_var = fir.pack_array %var
+    [stack ]
+    [innermost ]
+    [no_copy ]
+    [heuristics([none|loop-only]) ]
+    [constraints([max-size = <int>, ][max-element-size = <int>, ]
+    			 [min-stride = <int>]) ]
+    [typeparams %p1, ... ]
+    [<[acc.temp_copy_is_safe][omp.temp_copy_is_safe]>]
+    : !fir.box/class<!fir.array<...>>
+```
+
+The operation creates a new `!fir.box/class<!fir.array<>>` value to represent either the original `%var` or a newly allocated temporary array, maybe identical to `%var` by value.
+
+Arguments:
+
+* `stack` - indicates if `-fstack-arrays` is in effect for compiling this function.
+* `innermost` - tells that the repacking has to be done iff the array is not contiguous in the innermost dimension. This also describes what type of continuity can be expected from `%new_var`, i.e. `innermost` means that the resulting array is definitely contiguous in the innermost dimension, but may be non-contiguous in other dimensions (unless additional analysis proves otherwise). For 1-D arrays, `innermost` attribute is not valid.
+* `no_copy` - indicates that, in case a temporary array is created, `%var` to `%new_var` copy is not required (`intent(out)` dummy argument case).
+* `heuristics`
+  * `loop-only` - `fir.pack_array` can be optimized away, if the array is not used in a loop.
+  * `none` - `fir.pack_array` cannot be optimized based on the array usage pattern.
+* `constraints`
+  * `max-size` - constant integer attribute specifying the maximum byte size of an array that is eligible for repacking.
+  * `max-element-size` - constant integer attribute specifying the maximum byte element-size of an array that is eligible for repacking.
+  * `min-stride` - constant integer attribute specifying the minimum byte stride of the innermost dimension of an array that is eligible for repacking.
+* `typeparams` - type parameters of the element type.
+* `*.temp_copy_is_safe`: a list of attributes implementing `TempCopyIsSafe` attribute interface for generating a boolean value indicating whether using a temporary copy instead of the original array is safe in the current context.
+
+Memory effects are conservative, assuming that an allocation and copy may happen:
+
+* `MemAlloc` effect on either `AutomaticAllocationScopeResource` or `DefaultResource` depending on `stack` attribute.
+* If there is no `no_copy`:
+  * `MemRead` effect on unknown value to indicate potential read from the original array.
+    * [TBD] we can relax that by having an additional argument taking `fir.box_addr %var` value, though, this adds some redundancy to the argument list.
+  * `MemWrite` effect on unknown value to indicate potential write into the temporary array.
+  * [TBD] maybe we do not need the `MemRead`/`MemWrite` effects at all, because the temporary array is not distinguishable from the original array (at least until the repacking operations stay in IR), so any read/writes from the original array can be moved across `fir.pack_array`.
+
+Alias analysis:
+
+* For the purpose of alias analysis `fir.pack_array` should be considered a pass-though operation, and its value should be treated as `MayAlias` with the original array.
----------------
tblah wrote:

I think this is too pessimistic. I think this can be modeled as a new allocation.

https://github.com/llvm/llvm-project/pull/127147