[llvm] 4fc2c92 - [AArch64][SME] Document SME ABI implementation in LLVM
Sander de Smalen via llvm-commits
llvm-commits at lists.llvm.org
Fri Sep 16 07:48:52 PDT 2022
Author: Sander de Smalen
Date: 2022-09-16T14:48:37Z
New Revision: 4fc2c922fe12a6a4aa5de6cf24a2cd2cd5a4584a
URL: https://github.com/llvm/llvm-project/commit/4fc2c922fe12a6a4aa5de6cf24a2cd2cd5a4584a
DIFF: https://github.com/llvm/llvm-project/commit/4fc2c922fe12a6a4aa5de6cf24a2cd2cd5a4584a.diff
LOG: [AArch64][SME] Document SME ABI implementation in LLVM
Adds a design document for implementing the SME ABI in LLVM. This document
can be used as a reference for follow-up patches that attempt to implement
the ABI.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D131562
Added:
llvm/docs/AArch64SME.rst
Modified:
llvm/docs/UserGuides.rst
Removed:
################################################################################
diff --git a/llvm/docs/AArch64SME.rst b/llvm/docs/AArch64SME.rst
new file mode 100644
index 0000000000000..4585bb96664f9
--- /dev/null
+++ b/llvm/docs/AArch64SME.rst
@@ -0,0 +1,447 @@
+*****************************************************
+Support for AArch64 Scalable Matrix Extension in LLVM
+*****************************************************
+
+.. contents::
+ :local:
+
+1. Introduction
+===============
+
+The :ref:`AArch64 SME ACLE <aarch64_sme_acle>` provides a number of
+attributes for users to control PSTATE.SM and PSTATE.ZA.
+The :ref:`AArch64 SME ABI<aarch64_sme_abi>` describes the requirements for
+calls between functions when at least one of those functions uses PSTATE.SM or
+PSTATE.ZA.
+
+This document describes how the SME ACLE attributes map to LLVM IR
+attributes and how LLVM lowers these attributes to implement the rules and
+requirements of the ABI.
+
+Below we describe the LLVM IR attributes and their relation to the C/C++
+level ACLE attributes:
+
+``aarch64_pstate_sm_enabled``
+ is used for functions with ``__attribute__((arm_streaming))``
+
+``aarch64_pstate_sm_compatible``
+ is used for functions with ``__attribute__((arm_streaming_compatible))``
+
+``aarch64_pstate_sm_body``
+ is used for functions with ``__attribute__((arm_locally_streaming))`` and is
+ only valid on function definitions (not declarations)
+
+``aarch64_pstate_za_new``
+ is used for functions with ``__attribute__((arm_new_za))``
+
+``aarch64_pstate_za_shared``
+ is used for functions with ``__attribute__((arm_shared_za))``
+
+``aarch64_pstate_za_preserved``
+ is used for functions with ``__attribute__((arm_preserves_za))``
+
+Clang must ensure that the above attributes are added both to the
+function's declaration/definition as well as to their call-sites. This is
+important for calls to attributed function pointers, where there is no
+definition or declaration available.
+
+
+2. Handling PSTATE.SM
+=====================
+
+When changing PSTATE.SM the execution of FP/vector operations may be transferred
+to another processing element. This has three important implications:
+
+* The runtime SVE vector length may change.
+
+* The contents of FP/AdvSIMD/SVE registers are zeroed.
+
+* The set of allowable instructions changes.
+
+This leads to certain restrictions on IR and optimizations. For example, it
+is undefined behaviour to share vector-length dependent state between functions
+that may operate with
diff erent values for PSTATE.SM. Front-ends must honour
+these restrictions when generating LLVM IR.
+
+Even though the runtime SVE vector length may change, for the purpose of LLVM IR
+and almost all parts of CodeGen we can assume that the runtime value for
+``vscale`` does not. If we let the compiler insert the appropriate ``smstart``
+and ``smstop`` instructions around call boundaries, then the effects on SVE
+state can be mitigated. By limiting the state changes to a very brief window
+around the call we can control how the operations are scheduled and how live
+values remain preserved between state transitions.
+
+In order to control PSTATE.SM at this level of granularity, we use function and
+callsite attributes rather than intrinsics.
+
+
+Restrictions on attributes
+--------------------------
+
+* It is undefined behaviour to pass or return (pointers to) scalable vector
+ objects to/from functions which may use a
diff erent SVE vector length.
+ This includes functions with a non-streaming interface, but marked with
+ ``aarch64_pstate_sm_body``.
+
+* It is not allowed for a function to be decorated with both
+ ``aarch64_pstate_sm_compatible`` and ``aarch64_pstate_sm_enabled``.
+
+* It is not allowed for a function to be decorated with both
+ ``aarch64_pstate_za_new`` and ``aarch64_pstate_za_preserved``.
+
+* It is not allowed for a function to be decorated with both
+ ``aarch64_pstate_za_new`` and ``aarch64_pstate_za_shared``.
+
+These restrictions also apply in the higher level SME ACLE, which means we can
+emit diagnostics in Clang to signal users about incorrect behaviour.
+
+
+Compiler inserted streaming-mode changes
+----------------------------------------
+
+The table below describes the transitions in PSTATE.SM the compiler has to
+account for when doing calls between functions with
diff erent attributes.
+In this table, we use the following abbreviations:
+
+``N``
+ functions with a normal interface (PSTATE.SM=0 on entry, PSTATE.SM=0 on
+ return)
+
+``S``
+ functions with a Streaming interface (PSTATE.SM=1 on entry, PSTATE.SM=1
+ on return)
+
+``SC``
+ functions with a Streaming-Compatible interface (PSTATE.SM can be
+ either 0 or 1 on entry, and is unchanged on return).
+
+Functions with ``__attribute__((arm_locally_streaming))`` are excluded from this
+table because for the caller the attribute is synonymous to 'streaming', and
+for the callee it is merely an implementation detail that is explicitly not
+exposed to the caller.
+
+.. table:: Combinations of calls for functions with
diff erent attributes
+
+ ==== ==== =============================== ============================== ==============================
+ From To Before call After call After exception
+ ==== ==== =============================== ============================== ==============================
+ N N
+ N S SMSTART SMSTOP
+ N SC
+ S N SMSTOP SMSTART SMSTART
+ S S SMSTART
+ S SC SMSTART
+ SC N If PSTATE.SM before call is 1, If PSTATE.SM before call is 1, If PSTATE.SM before call is 1,
+ then SMSTOP then SMSTART then SMSTART
+ SC S If PSTATE.SM before call is 0, If PSTATE.SM before call is 0, If PSTATE.SM before call is 1,
+ then SMSTART then SMSTOP then SMSTART
+ SC SC If PSTATE.SM before call is 1,
+ then SMSTART
+ ==== ==== =============================== ============================== ==============================
+
+
+Because changing PSTATE.SM zeroes the FP/vector registers, it is best to emit
+the ``smstart`` and ``smstop`` instructions before register allocation, so that
+the register allocator can spill/reload registers around the mode change.
+
+The compiler should also have sufficient information on which operations are
+part of the call/function's arguments/result and which operations are part of
+the function's body, so that it can place the mode changes in exactly the right
+position. The suitable place to do this seems to be SelectionDAG, where it lowers
+the call's arguments/return values to implement the specified calling convention.
+SelectionDAG provides Chains and Glue to specify the order of operations and give
+preliminary control over the instruction's scheduling.
+
+
+Example of preserving state
+---------------------------
+
+When passing and returning a ``float`` value to/from a function
+that has a streaming interface from a function that has a normal interface, the
+call-site will need to ensure that the argument/result registers are preserved
+and that no other code is scheduled in between the ``smstart/smstop`` and the call.
+
+.. code-block:: llvm
+
+ define float @foo(float %f) nounwind {
+ %res = call float @bar(float %f) "aarch64_pstate_sm_enabled"
+ ret float %res
+ }
+
+ declare float @bar(float) "aarch64_pstate_sm_enabled"
+
+The program needs to preserve the value of the floating point argument and
+return value in register ``s0``:
+
+.. code-block:: none
+
+ foo: // @foo
+ // %bb.0:
+ stp d15, d14, [sp, #-80]! // 16-byte Folded Spill
+ stp d13, d12, [sp, #16] // 16-byte Folded Spill
+ stp d11, d10, [sp, #32] // 16-byte Folded Spill
+ stp d9, d8, [sp, #48] // 16-byte Folded Spill
+ str x30, [sp, #64] // 8-byte Folded Spill
+ str s0, [sp, #76] // 4-byte Folded Spill
+ smstart sm
+ ldr s0, [sp, #76] // 4-byte Folded Reload
+ bl bar
+ str s0, [sp, #76] // 4-byte Folded Spill
+ smstop sm
+ ldp d9, d8, [sp, #48] // 16-byte Folded Reload
+ ldp d11, d10, [sp, #32] // 16-byte Folded Reload
+ ldp d13, d12, [sp, #16] // 16-byte Folded Reload
+ ldr s0, [sp, #76] // 4-byte Folded Reload
+ ldr x30, [sp, #64] // 8-byte Folded Reload
+ ldp d15, d14, [sp], #80 // 16-byte Folded Reload
+ ret
+
+Setting the correct register masks on the ISD nodes and inserting the
+``smstart/smstop`` in the right places should ensure this is done correctly.
+
+
+Instruction Selection Nodes
+---------------------------
+
+.. code-block:: none
+
+ AArch64ISD::SMSTART Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask]
+ AArch64ISD::SMSTOP Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask]
+
+The ``SMSTART/SMSTOP`` nodes take ``CurrentState`` and ``ExpectedState`` operand for
+the case of a conditional SMSTART/SMSTOP. The instruction will only be executed
+if CurrentState != ExpectedState.
+
+When ``CurrentState`` and ``ExpectedState`` can be evaluated at compile-time
+(i.e. they are both constants) then an unconditional ``smstart/smstop``
+instruction is emitted. Otherwise the node is matched to a Pseudo instruction
+which expands to a compare/branch and a ``smstart/smstop``. This is necessary to
+implement transitions from ``SC -> N`` and ``SC -> S``.
+
+
+Unchained Function calls
+------------------------
+When a function with "``aarch64_pstate_sm_enabled``" calls a function that is not
+streaming compatible, the compiler has to insert a SMSTOP before the call and
+insert a SMSTOP after the call.
+
+If the function that is called is an intrinsic with no side-effects which in
+turn is lowered to a function call (e.g. ``@llvm.cos()``), then the call to
+``@llvm.cos()`` is not part of any Chain; it can be scheduled freely.
+
+Lowering of a Callsite creates a small chain of nodes which:
+
+- starts a call sequence
+
+- copies input values from virtual registers to physical registers specified by
+ the ABI
+
+- executes a branch-and-link
+
+- stops the call sequence
+
+- copies the output values from their physical registers to virtual registers
+
+When the callsite's Chain is not used, only the result value from the chained
+sequence is used, but the Chain itself is discarded.
+
+The ``SMSTART`` and ``SMSTOP`` ISD nodes return a Chain, but no real
+values, so when the ``SMSTART/SMSTOP`` nodes are part of a Chain that isn't
+used, these nodes are not considered for scheduling and are
+removed from the DAG. In order to prevent these nodes
+from being removed, we need a way to ensure the results from the
+``CopyFromReg`` can only be **used after** the ``SMSTART/SMSTOP`` has been
+executed.
+
+We can use a CopyToReg -> CopyFromReg sequence for this, which moves the
+value to/from a virtual register and chains these nodes with the
+SMSTART/SMSTOP to make them part of the expression that calculates
+the result value. The resulting COPY nodes are removed by the register
+allocator.
+
+The example below shows how this is used in a DAG that does not link
+together the result by a Chain, but rather by a value:
+
+.. code-block:: none
+
+ t0: ch,glue = AArch64ISD::SMSTOP ...
+ t1: ch,glue = ISD::CALL ....
+ t2: res,ch,glue = CopyFromReg t1, ...
+ t3: ch,glue = AArch64ISD::SMSTART t2:1, .... <- this is now part of the expression that returns the result value.
+ t4: ch = CopyToReg t3, Register:f64 %vreg, t2
+ t5: res,ch = CopyFromReg t4, Register:f64 %vreg
+ t6: res = FADD t5, t9
+
+We also need this for locally streaming functions, where an ``SMSTART`` needs to
+be inserted into the DAG at the start of the function.
+
+Functions with __attribute__((arm_locally_streaming))
+-----------------------------------------------------
+
+If a function is marked as ``arm_locally_streaming``, then the runtime SVE
+vector length in the prologue/epilogue may be
diff erent from the vector length
+in the function's body. This happens because we invoke smstart after setting up
+the stack-frame and similarly invoke smstop before deallocating the stack-frame.
+
+To ensure we use the correct SVE vector length to allocate the locals with, we
+can use the streaming vector-length to allocate the stack-slots through the
+``ADDSVL`` instruction, even when the CPU is not yet in streaming mode.
+
+This only works for locals and not callee-save slots, since LLVM doesn't support
+mixing two
diff erent scalable vector lengths in one stack frame. That means that the
+case where a function is marked ``arm_locally_streaming`` and needs to spill SVE
+callee-saves in the prologue is currently unsupported. However, it is unlikely
+for this to happen without user intervention, because ``arm_locally_streaming``
+functions cannot take or return vector-length-dependent values. This would otherwise
+require forcing both the SVE PCS using '``aarch64_sve_pcs``' combined with using
+``arm_locally_streaming`` in order to encounter this problem. This combination
+can be prevented in Clang through emitting a diagnostic.
+
+
+An example of how the prologue/epilogue would look for a function that is
+attributed with ``arm_locally_streaming``:
+
+.. code-block:: c++
+
+ #define N 64
+
+ void __attribute__((arm_streaming_compatible)) some_use(svfloat32_t *);
+
+ // Use a float argument type, to check the value isn't clobbered by smstart.
+ // Use a float return type to check the value isn't clobbered by smstop.
+ float __attribute__((noinline, arm_locally_streaming)) foo(float arg) {
+ // Create local for SVE vector to check local is created with correct
+ // size when not yet in streaming mode (ADDSVL).
+ float array[N];
+ svfloat32_t vector;
+
+ some_use(&vector);
+ svst1_f32(svptrue_b32(), &array[0], vector);
+ return array[N - 1] + arg;
+ }
+
+should use ADDSVL for allocating the stack space and should avoid clobbering
+the return/argument values.
+
+.. code-block:: none
+
+ _Z3foof: // @_Z3foof
+ // %bb.0: // %entry
+ stp d15, d14, [sp, #-96]! // 16-byte Folded Spill
+ stp d13, d12, [sp, #16] // 16-byte Folded Spill
+ stp d11, d10, [sp, #32] // 16-byte Folded Spill
+ stp d9, d8, [sp, #48] // 16-byte Folded Spill
+ stp x29, x30, [sp, #64] // 16-byte Folded Spill
+ add x29, sp, #64
+ str x28, [sp, #80] // 8-byte Folded Spill
+ addsvl sp, sp, #-1
+ sub sp, sp, #256
+ str s0, [x29, #28] // 4-byte Folded Spill
+ smstart sm
+ sub x0, x29, #64
+ addsvl x0, x0, #-1
+ bl _Z10some_usePu13__SVFloat32_t
+ sub x8, x29, #64
+ ptrue p0.s
+ ld1w { z0.s }, p0/z, [x8, #-1, mul vl]
+ ldr s1, [x29, #28] // 4-byte Folded Reload
+ st1w { z0.s }, p0, [sp]
+ ldr s0, [sp, #252]
+ fadd s0, s0, s1
+ str s0, [x29, #28] // 4-byte Folded Spill
+ smstop sm
+ ldr s0, [x29, #28] // 4-byte Folded Reload
+ addsvl sp, sp, #1
+ add sp, sp, #256
+ ldp x29, x30, [sp, #64] // 16-byte Folded Reload
+ ldp d9, d8, [sp, #48] // 16-byte Folded Reload
+ ldp d11, d10, [sp, #32] // 16-byte Folded Reload
+ ldp d13, d12, [sp, #16] // 16-byte Folded Reload
+ ldr x28, [sp, #80] // 8-byte Folded Reload
+ ldp d15, d14, [sp], #96 // 16-byte Folded Reload
+ ret
+
+
+Preventing the use of illegal instructions in Streaming Mode
+------------------------------------------------------------
+
+* When executing a program in streaming-mode (PSTATE.SM=1) a subset of SVE/SVE2
+ instructions and most AdvSIMD/NEON instructions are invalid.
+
+* When executing a program in normal mode (PSTATE.SM=0), a subset of SME
+ instructions are invalid.
+
+* Streaming-compatible functions must only use instructions that are valid when
+ either PSTATE.SM=0 or PSTATE.SM=1.
+
+The value of PSTATE.SM is not controlled by the feature flags, but rather by the
+function attributes. This means that we can compile for '``+sme``' and the compiler
+will code-generate any instructions, even if they are not legal under the requested
+streaming mode. The compiler needs to use the function attributes to ensure the
+compiler doesn't do transformations under the assumption that certain operations
+are available at runtime.
+
+We made a conscious choice not to model this with feature flags, because we
+still want to support inline-asm in either mode (with the user placing
+smstart/smstop manually), and this became rather complicated to implement at the
+individual instruction level (see `D120261 <https://reviews.llvm.org/D120261>`_
+and `D121208 <https://reviews.llvm.org/D121208>`_) because of limitations in
+TableGen.
+
+As a first step, this means we'll disable vectorization (LoopVectorize/SLP)
+entirely when the a function has either of the ``aarch64_pstate_sm_enabled``,
+``aarch64_pstate_sm_body`` or ``aarch64_pstate_sm_compatible`` attributes,
+in order to avoid the use of vector instructions.
+
+Later on we'll aim to relax these restrictions to enable scalable
+auto-vectorization with a subset of streaming-compatible instructions, but that
+requires changes to the CostModel, Legalization and SelectionDAG lowering.
+
+We will also emit diagnostics in Clang to prevent the use of
+non-streaming(-compatible) operations, e.g. through ACLE intrinsics, when a
+function is decorated with the streaming mode attributes.
+
+
+Other things to consider
+------------------------
+
+* Inlining must be disabled when the call-site needs to toggle PSTATE.SM or
+ when the callee's function body is executed in a
diff erent streaming mode than
+ its caller. This is needed because function calls are the boundaries for
+ streaming mode changes.
+
+* Tail call optimization must be disabled when the call-site needs to toggle
+ PSTATE.SM, such that the caller can restore the original value of PSTATE.SM.
+
+
+3. Handling PSTATE.ZA
+=====================
+
+In contrast to PSTATE.SM, enabling PSTATE.ZA does not affect the SVE vector
+length and also doesn't clobber FP/AdvSIMD/SVE registers. This means it is safe
+to toggle PSTATE.ZA using intrinsics. This also makes it simpler to setup a
+lazy-save mechanism for calls to private-ZA functions (i.e. functions that may
+either directly or indirectly clobber ZA state).
+
+For this purpose, we'll introduce a new LLVM IR pass that is run just before
+SelectionDAG.
+
+Setting up a lazy-save
+----------------------
+
+Committing a lazy-save
+----------------------
+
+Exception handling and ZA
+-------------------------
+
+4. References
+=============
+
+ .. _aarch64_sme_acle:
+
+1. `SME ACLE Pull-request <https://github.com/ARM-software/acle/pull/188>`__
+
+ .. _aarch64_sme_abi:
+
+2. `SME ABI Pull-request <https://github.com/ARM-software/abi-aa/pull/123>`__
diff --git a/llvm/docs/UserGuides.rst b/llvm/docs/UserGuides.rst
index 3e5252d81315f..517a23e959206 100644
--- a/llvm/docs/UserGuides.rst
+++ b/llvm/docs/UserGuides.rst
@@ -12,6 +12,7 @@ intermediate LLVM representation.
.. toctree::
:hidden:
+ AArch64SME
AddingConstrainedIntrinsics
AdvancedBuilds
AliasAnalysis
@@ -229,6 +230,9 @@ Additional Topics
LLVM's support for generating NEON instructions on big endian ARM targets is
somewhat nonintuitive. This document explains the implementation and rationale.
+:doc:`AArch64SME`
+ LLVM's support for AArch64 SME ACLE and ABI.
+
:doc:`CompileCudaWithLLVM`
LLVM support for CUDA.
More information about the llvm-commits
mailing list