[Mlir-commits] [mlir] [acc] OpenACC dialect design philosophy and details (PR #75548)

Thu Dec 14 16:21:12 PST 2023

llvmbot wrote:



@llvm/pr-subscribers-openacc

@llvm/pr-subscribers-mlir-openacc

Author: Razvan Lupusoru (razvanlupusoru)

<details>
<summary>Changes</summary>

This document captures the design philosophy of the acc dialect. It also shares the rationale behind the design and implementation of various operations - and ties that back to the dialect design goals.

Co-authored-by: Valentin Clement <clementval@gmail.com>
Co-authored-by: Slava Zakharin <szakharin@nvidia.com>

---

Patch is 23.52 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/75548.diff


2 Files Affected:

- (added) mlir/docs/Dialects/OpenACC.md (+451) 
- (modified) mlir/include/mlir/Dialect/OpenACC/OpenACCBase.td (+3-1) 


``````````diff

diff --git a/mlir/docs/Dialects/OpenACC.md b/mlir/docs/Dialects/OpenACC.md
new file mode 100755
index 00000000000000..20a121562d51f1
--- /dev/null
+++ b/mlir/docs/Dialects/OpenACC.md
@@ -0,0 +1,451 @@
+The `acc` dialect is an MLIR dialect for representing the OpenACC
+programming model. OpenACC is a standardized directive-based model which
+is used with C, C++, and Fortran to enable programmers to expose
+parallelism in their code. The descriptive approach used by OpenACC
+allows targeting of parallel multicore and accelerator targets like GPUs
+by giving the compiler the freedom of how to parallelize for specific
+architectures. OpenACC also provides the ability to optimize the
+parallelism through increasingly more prescriptive clauses.
+
+This dialect models the constructs from the [OpenACC 3.3 specification]
+(https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.3-final.pdf)
+
+This document describes the design of the OpenACC dialect in MLIR. It
+lists and explains design goals and design choices along with their
+rationale. It also describes specifics with regards to acc dialect
+operations, types, and attributes.
+
+[TOC]
+
+## Dialect Design Goals
+
+* Needs to have complete representation of the OpenACC language.
+	- A frontend requires this in order to properly generate a
+	representation of possible `acc` pragmas in MLIR. Additionally,
+	this dialect is expected to be further lowered when materializing
+	its semantics. Without a complete representation, a frontend might
+	choose a lower abstraction (such as direct runtime call) - but this
+	would impact the ability to do analysis and optimizations on the
+	dialect.
+* Allow representation at the same semantic level as the OpenACC
+language while having capability to represent nuances of the source
+language semantics (such as Fortran descriptors) in an agnostic manner.
+	- Using abstractions that closely model the OpenACC language
+	simplifies frontend implementation. It also allows for easier
+	debugging of the IR. However, sometimes source language specific
+	behavior is needed when materializing OpenACC. In these cases, such
+	as privatization of C++ objects with default constructor, the
+	frontend fills in the `recipe` along with the `private` operation
+	which can be packaged neatly with the `acc` dialect operations.
+* Be able to regenerate the semantic equivalent of the user pragmas from
+the dialect (including bounds, names, clauses, modifiers, etc).
+	- This is a strong measure of making sure that the dialect is not
+	lossy in semantics. It also allows capability to generate
+	appropriate and useful debug information outside of the frontend.
+* Be dialect agnostic so that it can be used and coexist with other
+dialects including but not limited to `hlfir`, `fir`, `llvm`, `cir`.
+	- Directive-based models such as OpenACC are always used with a
+	source language, so the `acc` dialect coexisting with other
+	dialect(s) is necessary by construction. Through proper
+	abstractions, neither the `acc` dialect nor the source language
+	dialect should have dependencies on each other; where needed,
+	interfaces should be used to ensure `acc` dialect can verify
+	expected properties.
+* The dialect must allow dataflow to be modeled accurately and
+performantly using MLIR's existing facilities.
+	Appropriate dataflow modeling is important for analyses and IR
+	reasoning - even something as simple as walking the uses. Therefore
+	operations, like data operations, are expected to generate results
+	which can be used in modeling behavior. For example, consider an
+	`acc copyin` clause. After the `acc.copyin` operation, a pointer
+	which lives on devices should be distinguishable from one that lives
+	in host memory.
+* Be friendly to MLIR optimization passes by implementing common
+interfaces.
+	Interfaces, such as `MemoryEffects`, are the key way MLIR
+	transformations and analyses are designed to interact with the IR.
+	In order for the operations in the `acc` dialect to be optimizable
+	(either directly or even indirectly by not blocking optimizations
+	of nested IR), implementing relevant common interfaces is needed.
+
+The design philosophy of the acc dialect is one where the design goals
+are adhered to. Current and planned operations, attributes, types must
+adhere to the design goals.
+
+## Operation Categories
+
+The OpenACC dialect includes both high-level operations (which retain
+the same semantic meaning as their OpenACC language equivalent),
+intermediate-level operations (which are used to decompose clauses
+from constructs), and low-level operations (to encode specifics
+associated with source language in a generic way).
+
+The high-level operations list contains the following OpenACC language
+constructs and their corresponding operations:
+* `acc parallel` → `acc.parallel`
+* `acc kernels` → `acc.kernels`
+* `acc serial` → `acc.serial`
+* `acc data` → `acc.data`
+* `acc loop` → `acc.loop`
+* `acc enter data` → `acc.enter_data`
+* `acc exit data` → `acc.exit_data`
+* `acc host_data` → `acc.host_data`
+* `acc init` → `acc.init`
+* `acc shutdown` → `acc.shutdown`
+* `acc update` → `acc.update`
+* `acc set` → `acc.set`
+* `acc wait` → `acc.wait`
+* `acc atomic read` → `acc.atomic.read`
+* `acc atomic write` → `acc.atomic.write`
+* `acc atomic update` → `acc.atomic.update`
+* `acc atomic capture` → `acc.atomic.capture`
+
+This second group contains operations which are used to represent
+either decomposed constructs or clauses for more accurate modeling:
+* `acc routine` → `acc.routine` + `acc.routine_info` attribute
+* `acc declare` → `acc.declare_enter` + `acc.declare_exit` or
+`acc.declare`
+* `acc {construct} copyin` → `acc.copyin` (before region) +
+`acc.delete` (after region)
+* `acc {construct} copy` → `acc.copyin` (before region) +
+`acc.copyout` (after region)
+* `acc {construct} copyout` → `acc.create` (before region) +
+`acc.copyout` (after region)
+* `acc {construct} attach` → `acc.attach` (before region) +
+`acc.detach` (after region)
+* `acc {construct} create` → `acc.create` (before region) +
+`acc.delete` (after region)
+* `acc {construct} present` → `acc.present` (before region) +
+`acc.delete` (after region)
+* `acc {construct} no_create` → `acc.nocreate` (before region) +
+`acc.delete` (after region)
+* `acc {construct} deviceptr` → `acc.deviceptr`
+* `acc {construct} private` → `acc.private`
+* `acc {construct} firstprivate` → `acc.firstprivate`
+* `acc {construct} reduction` → `acc.reduction`
+* `acc cache` → `acc.cache`
+* `acc update device` → `acc.update_device`
+* `acc update host` → `acc.update_host`
+* `acc host_data use_device` → `acc.use_device`
+* `acc declare device_resident` → `acc.declare_device_resident`
+* `acc declare link` → `acc.declare_link`
+* `acc exit data delete` → `acc.delete` (with `structured` flag as
+false)
+* `acc exit data detach` → `acc.detach` (with `structured` flag as
+false)
+* `acc {construct} {data_clause}(var[lb:ub])` → `acc.bounds`
+
+The low-level operations are:
+* `acc.private.recipe`
+* `acc.reduction.recipe`
+* `acc.firstprivate.recipe`
+* `acc.global_ctor`
+* `acc.global_dtor`
+* `acc.yield`
+* `acc.terminator`
+The low-level operations semantics and reasoning are further explained
+in sections below.
+
+### Data Operations
+
+#### Data Clause Decomposition
+The data clauses are decomposed from their constructs for better
+dataflow modeling in MLIR. There are multiple reasons for this which
+are consistent with the dialect goals:
+* Correctly represents dataflow. Data clauses have different effects
+at entry to region and at exit from region.
+* Friendlier to add attributes such as `MemoryEffects` to a single
+operation. This can better reflect semantics (like the fact that an
+`acc.copyin` operation only reads host memory)
+* Operations can be moved or optimized individually (eg `CSE`).
+* Easier to keep track of debug information. Line location can point to
+the text representing the data clause instead of the construct.
+Additionally, attributes can be used to keep track of variable names in
+clauses without having to walk the IR tree in attempt to recover the
+information (this makes acc dialect more agnostic with regards to what
+other dialect it is used with).
+* Clear operation ordering since all data operations are on same
+list.
+
+Each of the `acc` dialect data operations represents either the
+entry or the exit portion of the data action specification. Thus,
+`acc.copyin` represents the semantics defined in section
+`2.7.7 copyin clause` whose wording starts with
+`At entry to a region`. The decomposed exit operation `acc.delete`
+represents the second part of that section, whose wording starts with
+`At exit from the region`. The `delete` action may be performed
+after checking and updating of the relevant reference counters noted.
+
+The `acc` data operations, even when decomposed, retain their original
+data clause in an operation operand `dataClause` for possibility to
+recover this information during debugging. For example, `acc copy`,
+does not translate to `acc.copy` operation, but instead to `acc.copyin`
+for entry and `acc.copyout` for exit. Both the decomposed operations
+hold a `dataClause` field that specifies this was an `acc copy`.
+
+The link between the decomposed entry and exit operations is the ssa
+value produced by the entry operation. Namely, it is the `accPtr` result
+which is used both in the `dataOperands` of the operation used for the
+construct and in the `accPtr` operand of the exit operation.
+
+#### Bounds
+
+OpenACC data clauses allow the use of bounds specifiers as per
+`2.7.1 Data Specification in Data Clauses`. However, array dimensions
+for the data are not always required in the clause if the source
+language's type system captures this information - the user can just
+specify the variable name in the data clause. So the `acc.bounds`
+operation is an important piece to ensure uniform representation of both
+explicit user set dimensions and implicit type-based dimensions. It
+contains several key features to allow properly encoding sizes in a
+manner flexible and agnostic to the source language's dialect:
+* Multi-dimensional arrays can be represented by using multiple ordered
+`acc.bounds` operations.
+* Bounds are required to be zero-normalized. This works well with the
+`PointerLikeType` requirement in data clauses - since a lowerbound of 0
+means looking at data at the zero offset from pointer. This requirement
+also works well in ensuring the `acc` dialect is agnostic to source
+language dialect since it prevents ambiguity such as the case of Fortran
+arrays where the lower bound is not a fixed value.
+* If the source dialect does not encode the dimensions in the type (eg
+`!fir.array<?x?xi32>`) but instead encodes it in some other way (such as
+through descriptors), then the frontend must fill in the `acc.bounds`
+operands with appropriate information (such as loads from descriptor).
+The `acc.bounds` operation also permits lossy source dialect, such
+as if the frontend uses aggressive pointer decay and cannot represent
+the dimensions in the type system (eg using `!llvm.ptr` for arrays).
+Both of these aspects show `acc.bounds`' operation's flexibility to
+allow the representation to be agnostic since the `acc` dialect is not
+expected to be able to understand how to extract dimension information
+from the types of the source dialect.
+* The OpenACC specification allows either extent or upperbound in the
+data clause depending on whether it is Fortran or C and C++. The
+`acc.bounds` operation is rich enough to accept either or both - for
+convenience in lowering to the dialect and for ability to precisely
+capture the meaning from the clause.
+* The stride, either in units or bytes, can be also captured in the
+`acc.bounds` operation. This is also an important part to be able to
+accept a source language's arrays without forcing the frontend to
+normalize them in some way. For example, consider a case where in a
+parent function, a whole array is mapped to device. Then only a view of
+a non-1 stride is passed to child function (eg Fortran array slice with
+non-1 stride). A `copy` operation of this data in child should be able
+to avoid remapping this array. If instead the operation required
+normalizing the array (such as making it contiguous), then unexpected
+disjoint mapping of the same host data would be error-prone since it
+would result in multiple mappings to device.
+
+#### Counters
+
+The data operations also maintain semantics described in the OpenACC
+specification related to runtime counters. More specifically, consider
+the specification of the entry portion of `acc copyin` in section 2.7.7:
+```
+At entry to a region, the structured reference counter is used. On an
+enter data directive, the dynamic reference counter is used.
+- If var is present and is not a null pointer, a present increment
+action with the appropriate reference counter is performed.
+- If var is not present, a copyin action with the appropriate reference
+counter is performed.
+- If var is a pointer reference, an attach action is performed.
+```
+The `acc.copyin` operation includes these semantics, including those
+related to attach, which is specified through the `varPtrPtr` operand.
+The `structured` flag on the operation is important since the
+`structured reference counter` should be used when the flag is true; and
+the `dynamic reference counter` should be used when it is false.
+
+At exit from structured regions (`acc data`, `acc kernels`), the
+`acc copyin` operation is decomposed to `acc.delete` (with the
+`structured` flag as true). The semantics of the `acc.delete` are
+also consistent with the OpenACC specification noted for the exit
+portion of the `acc copyin` clause:
+```
+At exit from the region:
+- If the structured reference counter for var is zero, no action is
+taken.
+- Otherwise, a detach action is performed if var is a pointer reference,
+and a present decrement action with the structured reference counter is
+performed if var is not a null pointer. If both structured and dynamic
+reference counters are zero, a delete action is performed.
+```
+
+### Types
+
+There are a few acc dialect type categories to describe:
+* type of acc data clause operation input `varPtr`
+	- The type of `varPtr` must be pointer-like. This is done by
+	attaching the `PointerLikeType` interface to the appropriate MLIR
+	type. Although memory/storage concept is a lower level abstraction,
+	it is useful because the OpenACC model distinguishes between host
+	and device memory explicitly - and the mapping between the two is
+	done through pointers. Thus, by explicitly requiring it in the
+	dialect, the appropriate language frontend must create storage or
+	use type that satisfies the mapping constraint.
+* type of result of acc data clause operations
+	- The type of the acc data clause operation is exactly the same as
+	`varPtr`. This was done intentionally instead of introducing an
+	`acc.ref/ptr` type so that IR compatibility and the dialect's
+	existing strong type checking can be maintained. This is needed
+	since the `acc` dialect must live within another dialect whose type
+	system is unknown to it. The only constraint is that the appropriate
+	dialect type must use the `PointerLikeType` interface.
+* type of decomposed clauses
+	- Decomposed clauses, such as `acc.bounds` and `acc.declare_enter`
+	produce types to allow their results to be used only in specific
+	operations.
+
+### Recipes
+
+Recipes are a generic way to express source language specific semantics.
+
+There are currently two categories of recipes, but the recipe concept
+can be extended for any additional low-level information that needs
+to be captured for successful lowering of OpenACC. The two categories
+are:
+* recipes used in the context of privatization associated with a
+construct
+* recipes used in the context of additional specification of data
+semantics
+
+The intention of the recipes is to specify how materialization of
+action, such as privatization, should be done when the semantics
+of the action needs interpreted and lowered, such as before generating
+LLVM dialect.
+
+The recipes used for privatization provide a source-language independent
+way of specifying the creation of a local variable of that type. This
+means using the appropriate `alloca` instruction and being able to
+specify default initialization or default constructor.
+
+### Routine
+
+The routine directive is used to note that a procedure should be made
+available for the accelerator in a way that is consistent with its
+modifiers, such as those that describe the parallelism. In the acc
+dialect, an acc routine is represented through two joint pieces - an
+attribute and an operation:
+* The `acc.routine` operation is simply a specifier which notes which
+symbol (or string) the acc routine is needed for, along with parallelism
+associated. This defines a symbol that can be referenced in attribute.
+* The `acc.routine_info` attribute is an attribute used on the source
+dialect specific operation which specifies one or multiple `acc.routine`
+symbols. Typically, this is attached to `func.func` which either 
+provides the declaration (in case of externals) or provides the
+actual body of the acc routine in the dialect that the source language
+was translated to.
+
+### Declare
+
+OpenACC `declare` is a mechanism which declares a definition of a global
+or a local to be accessible to accelerator with an implicit lifetime
+as that of the scope where it was declared in. Thus, `declare` semantics
+are represented through multiple operations and attributes:
+* `acc.declare` - This is a structured operation which contains an
+MLIR region and can be used in similar manner as acc.data to specify
+an implicit data region with specific procedure lifetime. This is
+typically used inside `func.func` after variable declarations.
+* `acc.declare_enter` - This is an unstructured operation which is
+used as a decomposed form of `acc declare`. It effectively allows the
+entry operation to exist in a scope different than the exit operation.
+It can also be used along `acc.declare_exit` which consumes its token
+to define a scoped region without using MLIR region. This operation is
+also used in `acc.global_ctor`.
+* `acc.declare_exit` - The matching equivalent of `acc.declare_enter`
+except that it specifies exit semantics. This operation is typically
+used inside a `func.func` at the exit points or with `acc.global_dtor`.
+* `acc.global_ctor` - Lives at the same level as source dialect globals
+and is used to specify data actions to be done at program entry. This
+is used in conjunction with source dialect globals whose lifetime is
+not just a single procedure.
+* `acc.global_dtor` - Defines the exit data actions that should be done
+at program exit. Typically used to revert the actions of
+`acc.global_ctor`.
+
+The attributes:
+* `acc.declare` - This is a facility for easier determination of
+variables which are `acc declare`'d. This attribute is used on
+operations producing globals and on operations producing locals such as
+dialect specific `alloca`'s. Having this attribute is required in order
+to appear in a data mapping operation associated with any of the
+`acc.declare*` operations.
+* `acc.declare_action` - Since the OpenACC specification allows
+declaration of variables that have yet to be allocated, this attribute
+is used at the allocation and deallocation points. More specifically,
+this attribute captures symbols of functions to be called to perform
+an action either pre-allocate, post-allocate, pre-deallocate, or
+post-deallocate. Calls to these functions should be materialized when
+lowering OpenACC semantics to ensure proper data actions are done
+after the allocation/deallocation.
+
+## OpenACC Transforms and Analyses
+
+The design goal for the `acc` dialect is to be friendly to MLIR
+optimization passes including CSE and LICM. Additionally, since it is
+designed to recover original clauses, it makes late verification and
+analysis possible in the MLIR framework outside of the frontend.
+
+This section describes a few MLIR-level passes for which the `acc`...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/75548