[flang-commits] [flang] internap proc trampolines (PR #66156)

Slava Zakharin via flang-commits flang-commits at lists.llvm.org
Tue Sep 12 16:30:20 PDT 2023


https://github.com/vzakhari created https://github.com/llvm/llvm-project/pull/66156:

- [mlir][tensor] Check the EmptyOp's dynamicSize to be non-negative (#65577)
- [X86] Remove _REV instructions from the EVEX2VEX tables (#65752)
- [OpenMPOpt] Allow indirect calls in AAKernelInfoCallSite (#65836)
- [libomptarget][NFC]Rename targetDataMapper to targetDat in interface.cpp (#65915)
- [VP] more functional Intrinsic to definitions
- [clang][deps] NFCI: Use `FileEntryRef` in `ModuleDepCollectorPP`
- [clang] NFCI: Use `FileEntryRef` in 'clang-tools-extra'
- [AMDGPU] SILowerI1Copies: clear kill flags on COPY (#65883)
- [llvm-objdump] --adjust-vma adjust symbol table
- [AMDGPU] SILowerControlFlow: fix preservation of LiveIntervals
- [X86][EVEX512] Restrict attaching EVEX512 for default CPU only, NFCI (#65920)
- [clang][Diagnostics] Add source range to uninitialized diagnostics (#65896)
- [RISCV] Disable zcmp push/pop for variadic functions. (#65302)
- [include-cleaner] Always keep non-self-contained files (#65499)
- [AMDGPU][NFCI] Refactor BUFInstructions.td (#65746)
- [Clang][RISCV] Use Decl for checkRVVTypeSupport (#65778)
- [lldb][AArch64] Add type marker to ReadAll/WriteALLRegisterValues data
- [mlir][complex] Support fastmath in the binary op conversion. (#65702)
- [clangd] Delete deprecated enumerateTweaks endpoint
- [clang-format][NFC] Minor cleanup of token annotator and test cases
- [Driver] Remove duplicate -e on DragonFly (#65867)
- [clang][AArch64] Add --print-supported-extensions support (#65466)
- [ELF][test] Make tests less sensitive to addresses/number of sections
- [NFC] test commit
- [clang][AArch64] Fix supported extensions test case
- [clang][dataflow][NFC] Delete unused function. (#65602)
- [clangd] Rollforward include-cleaner library usage in symbol collector.
- [clangd] allow extracting to variable for lambda expressions
- [AArch64] Remove copy instruction between uaddlv and urshr
- [clangd] Forward --target to system include extraction (#65824)
- [mlir][Vector] Make `vector.contract` work with scalable vectors (#65724)
- [PHIElimination] Handle subranges in LiveInterval updates
- [SPIR-V] Add SPIR-V logical triple.
- [clangd] Fix buildbot breakages from stemming from 64366d4935d3c56ce5906a321edb2e91d4f886bc
- [mlir][llvm] Return failure from type converter for n-D scalable vectors (#65450)
- [SPIRV-V] Add SPIR-V logical triple to llc
- [X86] lea-2.ll - add test showing failure to fold shl(zext(or(x,c1)),c2) 'addlike' into LEA instruction
- [X86] promoteExtBeforeAdd - add support for or/xor 'addlike' patterns
- [Flang][OpenMP] Minor changes in reduction to work with HLFIR (#65775)
- [libc] Add missing add_lvalue_reference_t (#65940)
- [libc][bazel] Add CPP tests (#65941)
- [clang][Interp] Implement __builtin_offsetof
- [OpenMP][OMPT] Fix device identifier collision during callbacks (#65595)
- [clang][Interp] Check floating results for NaNs
- [mlir][vector] Extend mask calculation for vector.contract (#65733)
- [NFC][Analysis] Run update_analyze_test_checks.py on Analysis/CostModel/AArch64/sve-ldst.ll
- [Flang][OpenMP][Sema] Support propagation of REQUIRES information across program units
- [NFC][RemoveDIs] Prefer iterator-insertion over instructions
- [flang] Improve length information in character transformational (#65771)
- Revert rGa8cef6b58e2d41f04ed4fa63c3f628eac1a28925 "[X86] promoteExtBeforeAdd - add support for or/xor 'addlike' patterns"
- [lldb] Don't tab complete stop-hook delete beyond 1st argument
- updated buildkite pipeline generation (#65574)
- [mlir][vector][NFC] `isDisjointTransferIndices`: Use `getConstantIntValue` (#65931)
- [libc++][ranges][NFC] Status page: Adds `enumerate_view` patch
- Add missing vrnd intrinsics
- Revert "[Flang][OpenMP][Sema] Support propagation of REQUIRES information across program units"
- [analyzer] CStringChecker should check the first byte of the destination of strcpy, strncpy
- [analyzer] CStringChecker buffer access checks should check the first bytes
- [MLIR] Make SM_90 integration tests use `TargetAttr` (#65926)
- [Convergence] allow non-convergent ops before entry and loop intrinsics (#65939)
- [MLIR][PDL] Add Bytecode support for negated native constraints
- [mlir][Interfaces][NFC] DestinationStyleOpInterface: Improve documentation (#65927)
- Fixup "[analyzer] CStringChecker buffer access checks should check the first bytes"
- [VP] IR expansion for abs/smax/smin/umax/umin
- [VP] IR expansion for zext/sext/trunc/fptosi/fptosi/sitofp/uitofp/fptrunc/fpext
- [OpenMP][DeviceRTL][AMDGPU] Add missing libomptarget build targets (#65964)
- Fix warning in MSVC
- [NFC][Clang][RISCV] Fix typos of riscv-v-spec doc in riscv_vector.td (#65944)
- [DebugInfo] Parse StrOffsets section if needed
- [Flang][OpenMP][Sema] Support propagation of REQUIRES information across program units
- [NFC][RemoveDIs] Prefer iterators over inst-pointers in InstCombine
- [libc][NFC] Fix missing header in CMakelists.txt (#65960)
- [libc] Add type_traits tests (#65956)
- Remove extra switch from  0323938d
- [X86] matchIndexRecursively - add  zext(add/addlike(x,c)) -> index: zext(x), disp + zext(c) pattern handling
- [Driver] Do not generate error about unsupported target specific options when there is no compiler jobs
- [mlir][GPU] Handle LLVM pointer attributes on memref arguments.
- [RISCV] Add extract_subvector tests for a statically-known VLEN. NFC (#65389)
- [clang][VarDecl] Reset un-evaluated constant for all C++ modes (#65818)
- [RISCV] Refactor extract_subvector lowering slightly. NFC  (#65391)
- [lldb] Correctly invalidate unloaded image tokens (#65945)
- [lld][MachO] Add option to suppress mismatch profile errors (#65551)
- [mlir][spirv] Support `spirv.coopmatrix` type (de-)serialization (#65831)
- [GlobPattern][docs] Fix poorly rendered docs
- [RISCV] Shrink vslidedown when lowering fixed extract_subvector (#65598)
- [InlineAsm] refactor InlineAsm class NFC (#65649)
- Fold or-phi test
- [libcxx] Fix include directory order (#65859)
- [lldb][Tests] Reformat API tests with black
- [mlir][VectorOps] Don't drop scalable dims when lowering transfer_reads/writes (in VectorToLLVM)
- [NFC][RemoveDIs] Provide an iterator-taking split-block method
- ValueTracking: Add baseline tests for fcmp with non-0/inf constants
- [Driver][test] Don't check the version in the triple
- [include-cleaner] Fix handling of enums in presence of qualifiers (#65952)
- [lldb] Improve completion tests (#65973)
- [AMDGPU] Autogenerate min.ll/max.ll tests. NFC. (#65786)
- [NFC][AsmPrinter] Refactor DbgVariable as a std::variant
- [NFC][AsmPrinter] Remove dead multi-MMI handling from DwarfFile::addScopeVariable
- [NFC][AsmPrinter] Expose std::variant-ness of DbgVariable
- [NFC][AsmPrinter] Use std::visit in constructVariableDIEImpl
- [ADT] Remove any_isa (NFC) (#65636)
- Revert "[lldb] Improve completion tests (#65973)"
- [Fuchsia] Re-enable libcxx timezone database (#65981)
- Avoid running optimization passes in frontend test
- ProfDataUtils: Add extractFromBranchWeightMD function; NFC
- LoopRotate: Add code to update branch weights
- [MLIR][Linalg] Improve documentation in `LinalgInterfaces.td` (NFC)
- [AMDGPU] Global ISel for packed fp32 instructions (#65803)
- [RISCV] Lower fixed vectors extract_vector_elt through stack at high LMUL
- [ELF] Respect orders of symbol assignments and DEFINED (#65866)
- Re-apply 75c487602a "[ORC] Add a MachOBuilder utility, use it to..." with fixes.
- [DebugInfo] Use getStableDebugLoc to pick IRBuilder DebugLocs
- [test][sanitizer] Check LINKER_IS_LLD  to detect LLD
- [libc] Manually set the AMDGPU code object version (#65986)
- [llvm-readelf] Add --extra-sym-info (#65580)
- Revert "[PHIElimination] Handle subranges in LiveInterval updates"
- [libc++] Add regression tests for issue #46841
- [libc++abi] Overhaul test_exception_storage.pass.cpp
- [SLP][NFC]Use ArrayReffor operands directly instead of entry/operand number, NFC.
- [libc++] Use the default initializer for char_type in std::num_get::do_get
- [lld-macho][nfc]Add bounds on sections and subsections check before attempting to dereferencing iterators.
- [libc++] Mark static variables of locale::id as constinit (#65783)
- [libc][NFC] Eliminate the internal header library target. (#65837)
- [AsmPrinter] Fix an unused variable warning
- [InlineAsm] fix msvc warning
- [NFC][RemoveDIs] Use iterators over inst-pointers when using IRBuilder
- [LLDB][NFC] Add the mojo language to Language::GetPrimaryLanguage
- [RISCV] Add a combine to form masked.load from unit strided load (#65674)
- [test][tsan] Disable flaky test on PPC
- Lazy initialize diagnostic when handling MLIR properties (#65868)
- [sanitizer] Change return type of __sanitizer_symbolize_demangle to bool (#65991)
- [mlir][gpu] Deprecate gpu::Serialization* passes. (#65857)
- [mlir][openacc] Model acc cache directive as data entry operands on acc.loop (#65521)
- [flang][openacc] Lower acc cache directive (#65673)
- [mlir][Linalg] Move `linalg.fill` -> `linalg.pack` pattern into `fill` canonicalization patterns. (#66002)
- [ELF] Reset two member variables in Ctx::reset
- [flang][openacc] Enable lowering support for OpenACC atomic operations (#65776)
- [lldb-vscode] Make descriptive summaries and raw child for synthetics configurable (#65687)
- [CUDA][HIP] Do not mark extern shared var (#65990)
- Revert "[Driver] Properly report error for unsupported powerpc darwin/macos triples"
- [mlir] Make it possible to build a DenseResourceElementsAttr from untyped memory. (#66009)
- [ORC] Fix implicit conversion warning due to 5293109774d.
- LoopUnrollRuntime: Add weights to all branches
- [test] Add x86-registered-target to amdgpu_throw_trap.cpp
- [cfi-verify tests]: Skip two x86-only tests if x86 is not enabled
- [flang] Call finalization on empty type (#66010)
- [test][hwasan] Disable test failing on x86_64 with no -lstdc++
- [test][hwasan] Relax test condition
- [test] Change llc -march= to -mtriple=
- [test] Change llvm-mc -arch= to -triple=
- [test] Change llc -march= to -mtriple= & llvm-mc -arch= to -triple=
- [test] debug-info-correlate.ll requires an ELF target triple
- Fix a few messed up links in the ReleaseNotes
- [BOLT][NFC] Use formatv in DataAggregator/DataReader prints
- [BOLT][NFC] Speedup YAML profile processing
- [mlir][sparse] Fix bug in new syntax parser (#66024)
- adds `__reference_constructs_from_temporary`
- [libc++][ranges] Fix a `split_view` test accidentally using `lazy_split`
- [Clang][Docs] Fix typo in clang-offload-packager documentation
- [test][hwsasan] Invert enable_aliases check
- Revert "adds `__reference_constructs_from_temporary`"
- Add "process metadata" Mach-O LC_NOTE for corefiles
- [test][hwasan] Disable the test as it fails on Arm as well
- [test][hwasan] Fix UNSUPPORTED condition
- [Parse] Split incremental-extensions (#65683)
- [NewGVN] Decrement UseCount only if SSA copy has one use
- [libc++] Fix broken test in C++03 mode
- [sanitizer][msan] VarArgHelper for loongarch64
- [Driver] Properly report error for unsupported powerpc darwin/macos triples
- [VP] IR expansion for maxnum/minnum
- [Fuchsia] Support building runtimes for RISC-V on Linux (#66025)
- [clang][Sema] Fix format size estimator's handling of %o, %x, %X with alternative form
- [InlineAsm] Add constraint A to getMemConstraintName (#65292)
- [libomptarget][NFC] update comments.
- [libcxx] <experimental/simd> Removed original implementations and tests
- [libcxx] <experimental/simd> Add ABI tags, class template simd/simd_mask implementations. Add related simd traits and tests.
- [libcxx] <experimental/simd> Added simd width functions, simd_size traits and related tests
- [libcxx] <experimental/simd> Added aliagned flag types, traits is_simd_flag_type[_v], memory_alignment[_v] and related tests
- [libcxx] <experimental/simd> Added internal storage type, constructors, subscript operators of class simd/simd_mask and related tests
- [libcxx] <experimental/simd> Add broadcast constructor of class simd/simd_mask
- [libomptarget] Rename AMDGPUSignalTy member Signal to HSASignal.
- [gn build] Port 0e30dd44adc9
- [gn build] Port a284d0cc9c69
- [gn build] Port ce5652c78ac0
- [gn build] Port e7a45c6d768b
- [ELF][test] Make tests less sensitive of addresses/number of sections
- [mlir][arith] Rename operations: `maxf` → `maximumf`, `minf` → `minimumf` (#65800)
- Update the developer policy information on "Stay Informed" to refer to GitHub teams (#65798)
- [hwasan] Re-enable the test with fallback
- Include the issue description in the subscription comment so that email notification is self-contained (#65839)
- RegisterCoalescer: Correctly set valid lanes when keeping live out implicit defs
- RegisterCoalescer: Don't delete IMPLICIT_DEF if it's live into the same block
- [mlir] Support null interface to base conversion (#65988)
- [clang][dataflow] Merge `RecordValue`s with different locations correctly. (#65319)
- [AMDGPU] Regen combine-fma-add-mul-pre-legalize.mir
- [clang][dataflow] Eliminate `RecordValue::getChild()`. (#65586)
- [flang] Lower BIND(C) assumed length to CFI descriptor (#65950)
- Reland "[lldb] Improve completion tests (#65973)"
- [lldb] Format more Python files with black (#65979)
- [flang][hlfir] Add hlfir.maxval intrinsic (#65705)
- [AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410)
- Update some uses of `getAttr()` to be explicit about Inherent vs Discardable (NFC)
- [libc] Add is_object (#65749)
- [AMDGPU][GFX11] Add more test coverage for FMA instructions. (#65935)
- [AMDGPU] Handle inUndef flag in LiveVariables::recomputeForSingleDefVirtReg
- [AArch64][GISel] Expand test coverage of FPow.
- [AArch64][GlobalISel] Add lowering for constant BIT/BIF/BSP (#65897)
- Fix some AffineOps to properly declare their inherent affinemap Attribute (#66050)
- [clang][dataflow] Remove RecordValue.getLog() usage in HTMLLogger (#65645)
- [AMDGPU] Remove the GFX11 runs in CodeGen/AMDGPU/fma.f16.ll.
- [mlir][vector] Refine vector.transfer_read hoisting/forwarding (#65770)
- Revert "[AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410)" (#66060)
- [SPIR-V] Support SPV_INTEL_arbitrary_precision_integers_extension, misc utils for other extensions
- [GitHub] use checkout action v4 (#65819)
- [ci] escape artifacts paths
- [NVPTX] Remove NOP definition
- [SVE] Precommit test to show missing initialisation of call operand.
- [DAGCombiner][RISCV] Prefer to sext i32 non-negative values (#65984)
- [BOLT] Prevent adding secondary entry points for BB labels
- Add some -early-live-intervals RUN lines (#66058)
- [GuardWidening] Fix widening possibility check (#66064)
- [libc++][test] Add '-Wdeprecated-copy', '-Wdeprecated-copy-dtor' warnings to the test suite
- [GIsel][AArch64] Legalize <2 x i16> for G_INSERT_VECTOR_ELT (#65830)
- Revert "[AArch64][GlobalISel] Add lowering for constant BIT/BIF/BSP"
- [mlir][bufferization][NFC] Rename copy_tensor op to materialize_in_destination (#65467)
- Fix out of line Concept-comparisons of NestedNameSpecifiers (#65993)
- [AArch64]: Refactor target parser to use Bitset. (#65423)
- [NVPTX] Tighten up legal v2i16 ops a bit
- [OpenMP][test][VE] Limit the number of AFFINITY_MAX_CPUS for VE (#65872)
- [RISCV] Move getSmallestVTForIndex so it can be used by lowerINSERT_VECTOR_ELT. NFC
- [libc++] Document experimental features in the library (#65994)
- Don't rely in llvm::Bitset CTAD. NFC.
- [mlir][spirv] Improve coop matrix attribute handling (#66020)
- [AsmPrinter][DwarfDebug] Skip vars with fragments in different location kinds
- [SPIRV] Get pointer size from datalayout (#66096)
- Propagate the DWARF version from the main compiler invocation to PCHC… (#66032)
- [AMDGPU] Fix some MIR tests (#66090)
- [libc++] Add missing std::ranges::join_view to the list of experimental features
- [libc] Make add_header and add_gen_header targets normal library targets. (#66045)
- [AMDGPU] Add utilities to track number of user SGPRs. NFC.
- Add host-supports-nvptx requirement to lit tests (#66102)
- [AArch64][SME]Update intrinsic interface for ldr/str (#65593)
- [llvm][unittests] Remove unneeded header includes
- [AArch64][SME]Update intrinsic interface for read/write (#65594)
- [flang][hlfir][openacc] Updated LIT tests checks. (#66099)
- Update GoogleTest to v1.14.0 (#65823)
- [MC][RISCV] Add assembly syntax highlighting for RISCV (#65853)
- [libc++][test][NFC] Rewrite map count test and add test case for "final" compare
- [RISCV] Rework gather/scatter DAG combine structure [NFC]
- [flang][openacc] Check atomic update lhs/rhs are scalar (#66113)
- [libunwind] Use __builtin_alloca to avoid missing include
- [scudo] Allow using a different test main.
- [compiler-rt] Add missing include in unittest
- [Clang] Fix crash in Parser::ParseDirectDeclarator by adding check that token is not an annotation token
- [include-mapping] Python fixes
- Statically analyze likely and unlikely blocks based on metadata
- Splits cleanup block lowered by AsyncToAsyncRuntime. (#66123)
- Revert "Update GoogleTest to v1.14.0 (#65823)"
- [libc] Add missing deps for header libraries. (#66125)
- [libc++][hardening] Add back the safe mode.
- [AMDGPU] Fix an unused variable warning
- [Windows] Avoid using FileIndex for unique IDs
- Add host-supports-nvptx requirement to lit tests (#66129)
- [runtimes] Add llvm-size to RUNTIMES_TEST_DEPENDS
- [libc] Fix a typo in a CMakeLists.txt - replace DEPS with DEPENDS. (#66130)
- [Clang][OpenMP] Clang adding the addrSpace according to DataLayout fix (#65483)
- [libc][NFC] Factor GPU exiting into a common function (#66093)
- workflows/pr-subscriber: Handle libc++ and libc++abi labels (#66029)
- [mlir][gpu][sparse] gracefully accept zero size allocation (#66127)
- [clang][CodeGen] Emit annotations for function declarations.
- [sanitizer] Remove SYMBOLIZER_DEPS from symbolizer
- github-automation: Use a single comment for team mentions on pull requests (#66037)
- AMDGPU: Correctly lower llvm.sqrt.f32
- clang/OpenCL: Add inline implementations of sqrt in builtin header
- HIP: Directly use f32 sqrt intrinsic
- AMDGPU: Teach valueIsKnownNeverF32Denorm about frexp
- [llvm-exegesis] Add retry count to subprocess tests
- [test][dxil-dis] Update metadata to match target triple
- [AMDGPU] Port AMDGPURewriteUndefForPHI to new pass manager (#66008)
- Work around two more instances of __noinline__ conflicts. (#66138)
- JumpThreading: Propagate branch weights in tryToUnfoldSelectInCurrBB (#66116)
- [NFC] Sort `debuginfo` paths prior to updating the list (#66140)
- [flang] Cray pointer in module (#66119)
- fixup! [Clang][OpenMP] Clang adding the addrSpace according to DataLayout fix (#65483)
- [NFC] [Support] Fix warning when build with clang-cl on Windows. (#65387)
- [BOLT] Fix AutoFDO output format after D154120
- Fix NATVIS visualization of ActionResult
- [AMDGPU] Fix a warning
- [mlir][spirv] Fix remaining coop matrix verification corner cases (#66137)
- [OpenMP] Remove optimization skipping reduction struct initialization (#65697)
- [libc++][NFC] Remove stray #if 1 that was probably a debugging leftover
- [libc] Improve the implementation of the rand() function (#66131)
- [llvm] Adopt WithMarkup in the ARM backend (#65561)
- [libc++] Fix the rotate direction used in countl_zero()
- [github] GitHub Actions workflows changes (#65856)
- [lldb][NFCI] BreakpointResolverName ctor shouldn't unnecessarily copy data (#66001)
- [GlobalISel] GISelKnownBits: forward unused depth parameter
- Revert "[Clang][OpenMP] Clang adding the addrSpace according to DataLayout fix (#65483)"
- [libc++] Simplify the implementation of locale::id (#65781)
- [RFC][flang] Trampolines for internal procedures.


>From ee64f7c37b381ed171b19b02bddf5635da559fd8 Mon Sep 17 00:00:00 2001
From: Slava Zakharin <szakharin at nvidia.com>
Date: Tue, 12 Sep 2023 16:22:43 -0700
Subject: [PATCH] [RFC][flang] Trampolines for internal procedures.

I would like to start a discussion about the ways for modifying
the current trampolines approach for Fortran internal procedures
used as actual arguments or pointer targets.

As Peter Klausler noted before the current approach implies security
risks due to writeable and executable stack requirement. We may need
to agree on a new scheme that does not have this issue.
---
 flang/docs/InternalProcedureTrampolines.md | 373 +++++++++++++++++++++
 flang/docs/ProcedurePointer.md             |   2 +-
 2 files changed, 374 insertions(+), 1 deletion(-)
 create mode 100644 flang/docs/InternalProcedureTrampolines.md

diff --git a/flang/docs/InternalProcedureTrampolines.md b/flang/docs/InternalProcedureTrampolines.md
new file mode 100644
index 000000000000000..e91c87f7062d67e
--- /dev/null
+++ b/flang/docs/InternalProcedureTrampolines.md
@@ -0,0 +1,373 @@
+<!--===- docs/InternalProcedureTrampolines.md
+
+   Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+   See https://llvm.org/LICENSE.txt for license information.
+   SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+-->
+
+# Trampolines for pointers to internal procedures.
+
+## Overview
+
+```fortran
+subroutine host()
+  integer :: local = 10
+  call internal()
+  return
+
+  contains
+  subroutine internal()
+    print *, local
+  end subroutine internal
+end subroutine host
+```
+
+Procedure code generated for subprogram `inernal()` must have access to the scope of
+its host procedure, e.g. to access `local` variable. Flang achieves this by passing
+an extra argument to `internal()` that is a tuple of references to all variables
+used via host association inside `internal()`. We will call this extra argument
+a static chain link.
+
+Fortran standard 2008 allowed using internal procedures as actual arguments or
+procedure pointer targets:
+
+> Fortran 2008 contains several extensions to Fortran 2003; some of these are listed below.
+>
+> * An internal procedure can be used as an actual argument or procedure pointer target.
+>
+> NOTE 12.18
+>
+> An internal procedure cannot be invoked using a procedure pointer from either Fortran or C after the host instance completes execution, because the pointer is then undefined. While the host instance is active, however, the internal procedure may be invoked from outside of the host procedure scoping unit if that internal procedure was passed as an actual argument or is the target of a procedure pointer.
+
+Special handling is required for the internal procedures that might be invoked
+via an argument association or via pointer.
+This document describes Flang implementation to support it.
+
+## Flang current implementation
+
+### Examples
+
+Internal procedure as procedure pointer target:
+
+```fortran
+module other
+  abstract interface
+     function callback()
+       integer :: callback
+     end function callback
+  end interface
+  contains
+  subroutine foo(fptr)
+    procedure(callback), pointer :: fptr
+    ! `fptr` is pointing to `callee`, which needs the static chain link.
+    print *, fptr()
+  end subroutine foo
+end module other
+
+subroutine host(local)
+  use other
+  integer :: local
+  procedure(callback), pointer :: fptr
+  fptr => callee
+  call foo(fptr)
+  return
+
+  contains
+
+  function callee()
+    integer :: callee
+    callee = local
+  end function callee
+end subroutine host
+
+program main
+  call host(10)
+end program main
+```
+
+Internal procedure as actual argument (F90 style):
+
+```fortran
+module other
+  contains
+  subroutine foo(fptr)
+    interface
+      integer function fptr()
+      end function
+    end interface
+    ! `fptr` is pointing to `callee`, which needs the static chain link.
+    print *, fptr()
+  end subroutine foo
+end module other
+
+subroutine host(local)
+  use other
+  integer :: local
+  call foo(callee)
+  return
+
+  contains
+
+  function callee()
+    integer :: callee
+    callee = local
+  end function callee
+end subroutine host
+
+program main
+  call host(10)
+end program main
+```
+
+Internal procedure as actual argument (F77 style):
+
+```fortran
+module other
+  contains
+  subroutine foo(fptr)
+    integer :: fptr
+    ! `fptr` is pointing to `callee`, which needs the static chain link.
+    print *, fptr()
+  end subroutine foo
+end module other
+
+subroutine host(local)
+  use other
+  integer :: local
+  call foo(callee)
+  return
+
+  contains
+
+  function callee()
+    integer :: callee
+    callee = local
+  end function callee
+end subroutine host
+
+program main
+  call host(10)
+end program main
+```
+
+In all cases, the call sequence implementing `fptr()` call site inside `foo()`
+must pass the stack chain link to the actual function `callee()`.
+
+### Usage of trampolines in Flang
+
+`BoxedProcedure` pass recognizes `fir.emboxproc` operations that
+embox a subroutine address together with the static chain link,
+and transforms them into a sequence of operations that replace
+the result of `fir.emboxproc` with an address of a trampoline.
+Eventually, it is the address of the trampoline that is passed
+as an actual argument to `foo()`.
+
+The trampoline has the following structure:
+
+```assembly
+callee_trampoline:
+  MOV <static-chain-address>, R#
+  JMP <callee-address>
+```
+
+Where:
+- `<callee-address>` is the address of function `callee()`.
+- `<static-chain-address>` - the address of the static chain
+  object created inside `host()`.
+- `R#` is a target specific register.
+
+In MLIR LLVM dialect the replacement looks like this:
+
+```
+    llvm.call @llvm.init.trampoline(%8, %9, %7) : (!llvm.ptr<i8>, !llvm.ptr<i8>, !llvm.ptr<i8>) -> ()
+    %10 = llvm.call @llvm.adjust.trampoline(%8) : (!llvm.ptr<i8>) -> !llvm.ptr<i8>
+    %11 = llvm.bitcast %10 : !llvm.ptr<i8> to !llvm.ptr<func<void ()>>
+    llvm.call @_QMotherPfoo(%11) {fastmathFlags = #llvm.fastmath<fast>} : (!llvm.ptr<func<void ()>>) -> ()
+
+```
+
+So any call of `fptr` inside `foo()` will result in invocation of the trampoline.
+The trampoline will setup `R#` register and jump to `callee()` directly.
+
+The ABI of `callee()` is adjusted using `llvm.nest` call argument attribute,
+so that the target code generator assumes the static chain argument is passed
+to `callee()` in `R#`:
+
+```
+  llvm.func @_QFhostPcallee(%arg0: !llvm.ptr<struct<(ptr<i32>)>> {fir.host_assoc, llvm.nest}) -> i32 attributes {fir.internal_proc} {
+```
+
+#### Trampoline handling
+
+Currently used [llvm.init.trampoline intrinsic](https://llvm.org/docs/LangRef.html#trampoline-intrinsics)
+expects that the memory for the trampoline content is passed to it as the first argument.
+The memory has to be writeable at the point of the intrinsic call, and it has to be executable
+at any point where `callee()` might be ivoked via the trampoline.
+
+`@llvm.init.trampoline` intrinsic initializes the trampoline area in a target-specific manner
+so that being executed: the trampoline sets a target-specific register to be equal to the third argument
+(which is a static chain address), and then calls the function defined by the second argument.
+
+Some targets may perform additional actions to guarantee the readiness of the trampoline for execution,
+e.g. [call](https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/trampoline_setup.c)
+`__clear_cache` or do something else.
+
+For each internal procedure a trampoline may be initialized once per the host invocation.
+
+The target-specific address of the new trampoline function must be taken via another intrinsic call:
+
+```
+%p = call i8* @llvm.adjust.trampoline(i8* %trampoline_address)
+```
+
+Note that value of `%p` is equal to `%tramp1` in most cases, but this is not
+a requirement - this is partly [why](https://lists.llvm.org/pipermail/llvm-dev/2011-August/042845.html)
+the second intrinsic was introduced:
+
+> ```
+> By the way an example of adjust_trampoline is ARM, which or's a 1 into the address of the trampoline.  When the pointer is called the processor sees the 1 and puts itself into thumb mode.
+
+Currently, the trampolines are allocated on the stack of `host()` subroutine,
+so that they are available throughout the life span of `host()` and are
+automatically deallocated at the end of `host()` invocation.
+Unfortunately, this requires the program stack to be writeable and executable
+at the same time, which might be a security concern.
+
+> NOTE: LLVM's AArch64 backend supports `nest` attribute, but it does not seem to support trampoline intrinsics.
+
+## Alternative implementation(s)
+
+To address the security risk we may consider managing the trampoline memory
+in a way that it is not writeable and executable at the same time.
+One of the options is to use separate allocations for the trampoline code
+and the trampoline "data".
+
+The trampolines may be located in non-writeable executable memory:
+```assembly
+trampoline0:
+  MOV (TDATA[0].static_chain_address), R#
+  JMP (TDATA[0].callee_address)
+trampoline1:
+  MOV (TDATA[1].static_chain_address), R#
+  JMP (TDATA[1].callee_address)
+...
+```
+
+The `TDATA` memory is writeable and contains *<static chain address, function address>*
+for each of the trampolines.
+
+A runtime support library may provide APIs for initializing/accessing/deallocating
+the trampolines that can be used by `BoxedProcedure` pass.
+
+### Implementation considerations
+
+* The static chain address still has to be passed in fixed target-specific register,
+  and the implementations that rely on LLVM back-ends can use `nest` attribute for this.
+
+* The trampoline area must be able to grow, because there can be a trampoline
+  for each internal procedure per host invocation, and an internal procedure can call
+  the host recursively. This means that the amount of trampolines in one thread
+  may grow pretty quickly.
+
+  ```fortran
+  recursive subroutine host(local)
+    use other
+    integer :: local
+    call foo(callee)
+    return
+
+    contains
+
+    function callee()
+      integer :: callee
+      if (local .le. CONST_N) then
+         call host(local + 1)
+      endif
+    end function callee
+  end subroutine host
+  ```
+
+* On the other hand, putting a hard limit on the number of trampolines live at the same time
+  allows putting the trampolines into the static code segment.
+
+* Each thread may have its own dynamic trampoline area to reduce the number
+  of required locks.
+
+* Some support is required for the offload devices.
+
+* Each trampoline invocation implies two indirect accesses with this approach.
+
+### Option #1: Fortran runtime support
+
+The following APIs are suggested:
+
+```c++
+/**
+ * \brief Initializes new trampoline and returns its internal handle.
+ *
+ * Initializes new trampoline with the given \p callee_address
+ * and \p static_chain_address, and returns the new trampoline's
+ * internal handle. The compiler calls this method once per host
+ * invocation for each internal procedure that will need its address
+ * passed around.
+ *
+ * The initialization is reserving a new entry in TDATA and
+ * initializes the entry with the given \p callee_address and
+ * \p static_chain_address; it is also reserving a new entry
+ * in the trampoline area that is using the corresponding TDATA entry.
+ *
+ * Optional:
+ *   \p scratch may be used to switch between the trampoline pool
+ *   and llvm.init.trampoline implementation, e.g. if compiler passes
+ *   non-null \p scratch it will be used as a writeable/executable
+ *   memory for the new trampoline.
+ */
+const void *InitTrampoline([[maybe_unused]] void *scratch,
+                           const void *callee_address,
+                           const void *static_chain_address);
+
+/**
+ * \brief Returns the trampoline's address for the given handle.
+ *
+ * \p handle is a value returned by InitTrampoline().
+ * The result of AdjustTrampoline() is the actual callable
+ * trampoline's address.
+ *
+ * Optional: may be implemented via llvm.adjust.trampoline.
+ */
+const void *AdjustTrampoline(const void *handle);
+
+/**
+ * \brief Frees internal resources occupied for the given trampoline.
+ *
+ * The compiler must call this API at every exit from the host function.
+ *
+ * Optional: may be no-op, if LLVM trampolines are used underneath.
+ */
+void FreeTrampoline(void *handle);
+```
+
+`InitTrampoline` will do the initial allocation of the TDATA memory
+and the trampoline area followed by the initialization of the trampoline
+area with the binary code to "link" the trampolines with the corresponding
+TDATA entries. After the initial allocation the trampoline area is made
+executable and not writeable.
+
+If there is an available entry in the TDATA/trampoline area, then the function
+will initialized the TDATA entry with the given arguments and return
+a handle to the trampoline entry.
+
+`FreeTrampoline` will free the reserved entry.
+
+### Option #2: LLVM/compiler-rt support
+
+It may be beneficial for projects besides Flang to use the alternative trampolines
+implementation, so does it sound reasonable to actually put the support
+into LLVM/compiler-rt?
+
+### Implementation questions
+
+* The trampoline area initialization implies writing target specific binary code
+  for the trampolines. Are there utils that the runtime implementation
+  can reuse?
diff --git a/flang/docs/ProcedurePointer.md b/flang/docs/ProcedurePointer.md
index b41c4003518ec05..da4848ff197bdcd 100644
--- a/flang/docs/ProcedurePointer.md
+++ b/flang/docs/ProcedurePointer.md
@@ -280,7 +280,7 @@ due to C721 and C723.
 Initially the current plan is to implement pointers to internal procedures
 using the LLVM Trampoline intrinsics. This has the drawback of requiring the
 stack to be executable, which is a security hole. To avoid this, we will need
-improve the implementation to use heap-resident thunks.
+[improve the implementation](InternalProcedureTrampolines.md) to use heap-resident thunks.
 
 ### Procedure pointer assignment `p => proc`
 



More information about the flang-commits mailing list