[flang-commits] [flang] Parallel runtime library design doc (PRIF) (PR #76088)

Thu Oct 17 13:22:16 PDT 2024

bonachea wrote:

Hi @klausler - I’m part of the [PRIF team at Berkeley Lab](https://fortran.lbl.gov/). Thanks for the great questions!

@klausler said:
> Target hardware for coarray Fortran includes two important subsets: those targets whose interconnects admit direct load/store access to remote data, and those whose data transfers are driven by controlling an RDMA NIC's MMRs. [...]  it would be useful to have a runtime library interface to perform the necessary address calculation to compute a remote base address for a given coarray on a particular image, [...] allow an optimizer to amortize the cost of that calculation when multiple references to the same corray/image will follow

The actual hardware landscape is more complicated than you seem to imply. In particular, many modern HPC platforms demonstrate a combination of both characteristics you describe, i.e., cache-coherent load-store access between cores/processes running within a single physical memory domain (i.e. "intra-node") *AND* RDMA access across an explicit interconnect between physical domains (i.e. "inter-node"). In general we care about deployments where both classes of transport may be *simultaneously* active in a given job execution, and the transport distinction is not (in general) globally static, but instead depends on the physical placement of the images involved in a given communication operation. A purely "single-node" system, where all images happen to use an intra-node transport is just a special case of this more general situation.

The overheads involved in initiating communication over an RDMA transport are generally on the order of microseconds (corresponding to thousands of cycles on a modern processor), and usually dwarf details like serial instructions for address arithmetic by several orders of magnitude. Amortizing constant-time "setup" overheads for RDMA communication is unlikely to be a fruitful optimization, and would require exposing non-portable interconnect-specific details to the PRIF client.

However a load/store transport through hardware-managed shared memory tends to be orders of magnitude faster (in overhead and latency), and in this case overheads like address translation are expected to have a greater relative impact. Here the cost of redundant address calculations and even the cost of extra procedure calls may become significant relative to the cost of a load/store-based data transfer. As such, we agree that on such transports, there are opportunities for fruitful amortization of communication "setup" overheads. We envision eventually expanding PRIF with calls allowing the client to detect and take advantage of this situation when appropriate. This is explicitly documented in Section 7: Future Work.

@klausler said:
> when the target interconnect is known at compilation time to be one that supports asynchronous transactions, it would similarly be useful to have runtime library interfaces to initiate asynchronous reads and await their completions, again for hiding load latency.

We agree that on multi-node networks, asynchronous communication is an effective means to hide communication latency by overlapping it with computation or other communication. [Our group](https://go.lbl.gov/class) has a long history of exploiting those types of communication optimizations in the context of other parallel programming models. PRIF currently lacks entry points for explicitly asynchronous communication operations, because we wanted to start with the simplest interface that would allow a complete and compliant implementation of Fortran’s multi-image parallel semantics. We would like to see future revisions of PRIF add extensions for explicitly asynchronous communication (especially for coindexed reads, as you suggest), as documented in Section 7: Future Work.

https://github.com/llvm/llvm-project/pull/76088