[flang-commits] [flang] [Flang] - Add optional inlining of allocatable assignments with hlfir.expr RHS (PR #186880)

Wed Mar 18 07:53:23 PDT 2026

bhandarkar-pranav wrote:

> > The above is a reduced testcase from one of our internal benchmarks.
> 
> Thank you for the example. I am trying to understand where this enormous speed up is coming from. Can you please confirm that in both cases you have a temporary array created for `cos(a)` elemental operation? If it is the case, then does it mean that the library imlementation of `Assign` is much slower than the inlined code, and may it be the case that the library may be compiled "better" to reduce the gap?

I am sorry I do not fully understand this. The `hlfir.elemental` should not have a temporary array associated with it (because it produces an `hlfir.expr`) until after lowering to FIR right? I profiled the passes and `OpenMPOpt` is the pass that blows up in time during LTO whenever `__FortranAAssign` is called (and it's body is pulled in from the runtime library). This is my output when I add `-time-passes` to clang-linker-wrapper

```
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 21.7081 seconds (21.7052 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   6.5107 ( 30.8%)   0.0688 ( 11.5%)   6.5795 ( 30.3%)   6.5799 ( 30.3%)  OpenMPOptPass
   2.1780 ( 10.3%)   0.0010 (  0.2%)   2.1790 ( 10.0%)   2.1791 ( 10.0%)  OpenMPOptCGSCCPass
   1.7045 (  8.1%)   0.2698 ( 45.2%)   1.9743 (  9.1%)   1.9744 (  9.1%)  AMDGPU DAG->DAG Pattern Instruction Selection

```

> Overall, I am okay with doing the inlining, thought I would think the library implementation may be faster in some cases (e.g. it may use just a single `memcpy` for the contiguous array of any rank vs potentially multiple `memcpy` that LLVM would generate for the N-level loop nest).

I agree with this concern, which is part of the reason why I have guarded this with a flag.

https://github.com/llvm/llvm-project/pull/186880