[all-commits] [llvm/llvm-project] 2b60ed: [flang] Use Assign() runtime for copy-in/copy-out.

Wed Dec 21 09:55:49 PST 2022

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 2b60ed405b8110b20ab2e383839759ea34003127
      https://github.com/llvm/llvm-project/commit/2b60ed405b8110b20ab2e383839759ea34003127
  Author: Slava Zakharin <szakharin at nvidia.com>
  Date:   2022-12-21 (Wed, 21 Dec 2022)

  Changed paths:
    M flang/lib/Lower/ConvertExpr.cpp
    M flang/test/Lower/call-by-value-attr.f90
    M flang/test/Lower/call-copy-in-out.f90
    M flang/test/Lower/dummy-argument-assumed-shape-optional.f90
    M flang/test/Lower/dummy-argument-optional-2.f90
    M flang/test/Lower/optional-value-caller.f90
    M flang/test/Lower/parent-component.f90

  Log Message:
  -----------
  [flang] Use Assign() runtime for copy-in/copy-out.

The loops generated under IsContiguous check for copy-in/copy-out
result in LLVM backend spending too much time optimizing them.
At the same time, the copy loops do not provide any optimization
opportunities with the surrounding code (since they are executed
under runtime IsContiguous check), so the copy code may be optimized
on its own and this can be done in runtime.

I thought I could implement and use new APIs for packing/unpacking
non-contiguous data (interfaces added in D136378), but then I found
that Assign() is already doing what is needed. If performance
becomes an issue for these loops, we can optimize code in Assign()
rather than creating new APIs.

Thus, this change makes use of Assign() for copy-in/copy-out
of boxed objects, and this is done only if the objects
are non-contiguous during execution. Copies for non-boxed
objects (e.g. for passing as VALUE dummy argument) are still
done inline, because they can potentially be optimized with
surrounding loops.

I added internal -inline-copyinout-for-boxes option to revert to the old
behavior just to make it easier to triage performance regressions,
if any appear after the change.

CPU2017/521.wrf compiles for 2179 seconds without the change and
the module_dm.f90 compiled with -O0 (without -O0 this single
module compiles for 5775 seconds). With the change total compilation
time of the benchmark reduces to 722 seconds.

Differential Revision: https://reviews.llvm.org/D140446