[llvm] 4fc2c92 - [AArch64][SME] Document SME ABI implementation in LLVM

Fri Sep 16 07:48:52 PDT 2022

Author: Sander de Smalen
Date: 2022-09-16T14:48:37Z
New Revision: 4fc2c922fe12a6a4aa5de6cf24a2cd2cd5a4584a

URL: https://github.com/llvm/llvm-project/commit/4fc2c922fe12a6a4aa5de6cf24a2cd2cd5a4584a
DIFF: https://github.com/llvm/llvm-project/commit/4fc2c922fe12a6a4aa5de6cf24a2cd2cd5a4584a.diff

LOG: [AArch64][SME] Document SME ABI implementation in LLVM

Adds a design document for implementing the SME ABI in LLVM. This document
can be used as a reference for follow-up patches that attempt to implement
the ABI.

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D131562

Added: 
    llvm/docs/AArch64SME.rst

Modified: 
    llvm/docs/UserGuides.rst

Removed: 
    


################################################################################
diff  --git a/llvm/docs/AArch64SME.rst b/llvm/docs/AArch64SME.rst
new file mode 100644
index 0000000000000..4585bb96664f9

--- /dev/null
+++ b/llvm/docs/AArch64SME.rst
@@ -0,0 +1,447 @@
+*****************************************************
+Support for AArch64 Scalable Matrix Extension in LLVM
+*****************************************************
+
+.. contents::
+   :local:
+
+1. Introduction
+===============
+
+The :ref:`AArch64 SME ACLE <aarch64_sme_acle>` provides a number of
+attributes for users to control PSTATE.SM and PSTATE.ZA.
+The :ref:`AArch64 SME ABI<aarch64_sme_abi>` describes the requirements for
+calls between functions when at least one of those functions uses PSTATE.SM or
+PSTATE.ZA.
+
+This document describes how the SME ACLE attributes map to LLVM IR
+attributes and how LLVM lowers these attributes to implement the rules and
+requirements of the ABI.
+
+Below we describe the LLVM IR attributes and their relation to the C/C++
+level ACLE attributes:
+
+``aarch64_pstate_sm_enabled``
+    is used for functions with ``__attribute__((arm_streaming))``
+
+``aarch64_pstate_sm_compatible``
+    is used for functions with ``__attribute__((arm_streaming_compatible))``
+
+``aarch64_pstate_sm_body``
+  is used for functions with ``__attribute__((arm_locally_streaming))`` and is
+  only valid on function definitions (not declarations)
+
+``aarch64_pstate_za_new``
+  is used for functions with ``__attribute__((arm_new_za))``
+
+``aarch64_pstate_za_shared``
+  is used for functions with ``__attribute__((arm_shared_za))``
+
+``aarch64_pstate_za_preserved``
+  is used for functions with ``__attribute__((arm_preserves_za))``
+
+Clang must ensure that the above attributes are added both to the
+function's declaration/definition as well as to their call-sites. This is
+important for calls to attributed function pointers, where there is no
+definition or declaration available.
+
+
+2. Handling PSTATE.SM
+=====================
+
+When changing PSTATE.SM the execution of FP/vector operations may be transferred
+to another processing element. This has three important implications:
+
+* The runtime SVE vector length may change.
+
+* The contents of FP/AdvSIMD/SVE registers are zeroed.
+
+* The set of allowable instructions changes.
+
+This leads to certain restrictions on IR and optimizations. For example, it
+is undefined behaviour to share vector-length dependent state between functions
+that may operate with 
diff erent values for PSTATE.SM. Front-ends must honour
+these restrictions when generating LLVM IR.
+
+Even though the runtime SVE vector length may change, for the purpose of LLVM IR
+and almost all parts of CodeGen we can assume that the runtime value for
+``vscale`` does not. If we let the compiler insert the appropriate ``smstart``
+and ``smstop`` instructions around call boundaries, then the effects on SVE
+state can be mitigated. By limiting the state changes to a very brief window
+around the call we can control how the operations are scheduled and how live
+values remain preserved between state transitions.
+
+In order to control PSTATE.SM at this level of granularity, we use function and
+callsite attributes rather than intrinsics.
+
+
+Restrictions on attributes
+--------------------------
+
+* It is undefined behaviour to pass or return (pointers to) scalable vector
+  objects to/from functions which may use a 
diff erent SVE vector length.
+  This includes functions with a non-streaming interface, but marked with
+  ``aarch64_pstate_sm_body``.
+
+* It is not allowed for a function to be decorated with both
+  ``aarch64_pstate_sm_compatible`` and ``aarch64_pstate_sm_enabled``.
+
+* It is not allowed for a function to be decorated with both
+  ``aarch64_pstate_za_new`` and ``aarch64_pstate_za_preserved``.
+
+* It is not allowed for a function to be decorated with both
+  ``aarch64_pstate_za_new`` and ``aarch64_pstate_za_shared``.
+
+These restrictions also apply in the higher level SME ACLE, which means we can
+emit diagnostics in Clang to signal users about incorrect behaviour.
+
+
+Compiler inserted streaming-mode changes
+----------------------------------------
+
+The table below describes the transitions in PSTATE.SM the compiler has to
+account for when doing calls between functions with 
diff erent attributes.
+In this table, we use the following abbreviations:
+
+``N``
+  functions with a normal interface (PSTATE.SM=0 on entry, PSTATE.SM=0 on
+  return)
+
+``S``
+  functions with a Streaming interface (PSTATE.SM=1 on entry, PSTATE.SM=1
+  on return)
+
+``SC``
+  functions with a Streaming-Compatible interface (PSTATE.SM can be
+  either 0 or 1 on entry, and is unchanged on return).
+
+Functions with ``__attribute__((arm_locally_streaming))`` are excluded from this
+table because for the caller the attribute is synonymous to 'streaming', and
+for the callee it is merely an implementation detail that is explicitly not
+exposed to the caller.
+
+.. table:: Combinations of calls for functions with 
diff erent attributes
+
+   ==== ==== =============================== ============================== ==============================
+   From To   Before call                     After call                     After exception
+   ==== ==== =============================== ============================== ==============================
+   N    N
+   N    S    SMSTART                         SMSTOP
+   N    SC
+   S    N    SMSTOP                          SMSTART                        SMSTART
+   S    S                                                                   SMSTART
+   S    SC                                                                  SMSTART
+   SC   N    If PSTATE.SM before call is 1,  If PSTATE.SM before call is 1, If PSTATE.SM before call is 1,
+             then SMSTOP                     then SMSTART                   then SMSTART
+   SC   S    If PSTATE.SM before call is 0,  If PSTATE.SM before call is 0, If PSTATE.SM before call is 1,
+             then SMSTART                    then SMSTOP                    then SMSTART
+   SC   SC                                                                  If PSTATE.SM before call is 1,
+                                                                            then SMSTART
+   ==== ==== =============================== ============================== ==============================
+
+
+Because changing PSTATE.SM zeroes the FP/vector registers, it is best to emit
+the ``smstart`` and ``smstop`` instructions before register allocation, so that
+the register allocator can spill/reload registers around the mode change.
+
+The compiler should also have sufficient information on which operations are
+part of the call/function's arguments/result and which operations are part of
+the function's body, so that it can place the mode changes in exactly the right
+position. The suitable place to do this seems to be SelectionDAG, where it lowers
+the call's arguments/return values to implement the specified calling convention.
+SelectionDAG provides Chains and Glue to specify the order of operations and give
+preliminary control over the instruction's scheduling.
+
+
+Example of preserving state
+---------------------------
+
+When passing and returning a ``float`` value to/from a function
+that has a streaming interface from a function that has a normal interface, the
+call-site will need to ensure that the argument/result registers are preserved
+and that no other code is scheduled in between the ``smstart/smstop`` and the call.
+
+.. code-block:: llvm
+
+    define float @foo(float %f) nounwind {
+      %res = call float @bar(float %f) "aarch64_pstate_sm_enabled"
+      ret float %res
+    }
+
+    declare float @bar(float) "aarch64_pstate_sm_enabled"
+
+The program needs to preserve the value of the floating point argument and
+return value in register ``s0``:
+
+.. code-block:: none
+
+    foo:                                    // @foo
+    // %bb.0:
+            stp     d15, d14, [sp, #-80]!           // 16-byte Folded Spill
+            stp     d13, d12, [sp, #16]             // 16-byte Folded Spill
+            stp     d11, d10, [sp, #32]             // 16-byte Folded Spill
+            stp     d9, d8, [sp, #48]               // 16-byte Folded Spill
+            str     x30, [sp, #64]                  // 8-byte Folded Spill
+            str     s0, [sp, #76]                   // 4-byte Folded Spill
+            smstart sm
+            ldr     s0, [sp, #76]                   // 4-byte Folded Reload
+            bl      bar
+            str     s0, [sp, #76]                   // 4-byte Folded Spill
+            smstop  sm
+            ldp     d9, d8, [sp, #48]               // 16-byte Folded Reload
+            ldp     d11, d10, [sp, #32]             // 16-byte Folded Reload
+            ldp     d13, d12, [sp, #16]             // 16-byte Folded Reload
+            ldr     s0, [sp, #76]                   // 4-byte Folded Reload
+            ldr     x30, [sp, #64]                  // 8-byte Folded Reload
+            ldp     d15, d14, [sp], #80             // 16-byte Folded Reload
+            ret
+
+Setting the correct register masks on the ISD nodes and inserting the
+``smstart/smstop`` in the right places should ensure this is done correctly.
+
+
+Instruction Selection Nodes
+---------------------------
+
+.. code-block:: none
+
+  AArch64ISD::SMSTART Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask]
+  AArch64ISD::SMSTOP  Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask]
+
+The ``SMSTART/SMSTOP`` nodes take ``CurrentState`` and ``ExpectedState`` operand for
+the case of a conditional SMSTART/SMSTOP. The instruction will only be executed
+if CurrentState != ExpectedState.
+
+When ``CurrentState`` and ``ExpectedState`` can be evaluated at compile-time
+(i.e. they are both constants) then an unconditional ``smstart/smstop``
+instruction is emitted. Otherwise the node is matched to a Pseudo instruction
+which expands to a compare/branch and a ``smstart/smstop``. This is necessary to
+implement transitions from ``SC -> N`` and ``SC -> S``.
+
+
+Unchained Function calls
+------------------------
+When a function with "``aarch64_pstate_sm_enabled``" calls a function that is not
+streaming compatible, the compiler has to insert a SMSTOP before the call and
+insert a SMSTOP after the call.
+
+If the function that is called is an intrinsic with no side-effects which in
+turn is lowered to a function call (e.g. ``@llvm.cos()``), then the call to
+``@llvm.cos()`` is not part of any Chain; it can be scheduled freely.
+
+Lowering of a Callsite creates a small chain of nodes which:
+
+- starts a call sequence
+
+- copies input values from virtual registers to physical registers specified by
+  the ABI
+
+- executes a branch-and-link
+
+- stops the call sequence
+
+- copies the output values from their physical registers to virtual registers
+
+When the callsite's Chain is not used, only the result value from the chained
+sequence is used, but the Chain itself is discarded.
+
+The ``SMSTART`` and ``SMSTOP`` ISD nodes return a Chain, but no real
+values, so when the ``SMSTART/SMSTOP`` nodes are part of a Chain that isn't
+used, these nodes are not considered for scheduling and are
+removed from the DAG.  In order to prevent these nodes
+from being removed, we need a way to ensure the results from the
+``CopyFromReg`` can only be **used after** the ``SMSTART/SMSTOP`` has been
+executed.
+
+We can use a CopyToReg -> CopyFromReg sequence for this, which moves the
+value to/from a virtual register and chains these nodes with the
+SMSTART/SMSTOP to make them part of the expression that calculates
+the result value. The resulting COPY nodes are removed by the register
+allocator.
+
+The example below shows how this is used in a DAG that does not link
+together the result by a Chain, but rather by a value:
+
+.. code-block:: none
+
+               t0: ch,glue = AArch64ISD::SMSTOP ...
+             t1: ch,glue = ISD::CALL ....
+           t2: res,ch,glue = CopyFromReg t1, ...
+         t3: ch,glue = AArch64ISD::SMSTART t2:1, ....   <- this is now part of the expression that returns the result value.
+       t4: ch = CopyToReg t3, Register:f64 %vreg, t2
+     t5: res,ch = CopyFromReg t4, Register:f64 %vreg
+   t6: res = FADD t5, t9
+
+We also need this for locally streaming functions, where an ``SMSTART`` needs to
+be inserted into the DAG at the start of the function.
+
+Functions with __attribute__((arm_locally_streaming))
+-----------------------------------------------------
+
+If a function is marked as ``arm_locally_streaming``, then the runtime SVE
+vector length in the prologue/epilogue may be 
diff erent from the vector length
+in the function's body. This happens because we invoke smstart after setting up
+the stack-frame and similarly invoke smstop before deallocating the stack-frame.
+
+To ensure we use the correct SVE vector length to allocate the locals with, we
+can use the streaming vector-length to allocate the stack-slots through the
+``ADDSVL`` instruction, even when the CPU is not yet in streaming mode.
+
+This only works for locals and not callee-save slots, since LLVM doesn't support
+mixing two 
diff erent scalable vector lengths in one stack frame. That means that the
+case where a function is marked ``arm_locally_streaming`` and needs to spill SVE
+callee-saves in the prologue is currently unsupported.  However, it is unlikely
+for this to happen without user intervention, because ``arm_locally_streaming``
+functions cannot take or return vector-length-dependent values. This would otherwise
+require forcing both the SVE PCS using '``aarch64_sve_pcs``' combined with using
+``arm_locally_streaming`` in order to encounter this problem. This combination
+can be prevented in Clang through emitting a diagnostic.
+
+
+An example of how the prologue/epilogue would look for a function that is
+attributed with ``arm_locally_streaming``:
+
+.. code-block:: c++
+
+    #define N 64
+
+    void __attribute__((arm_streaming_compatible)) some_use(svfloat32_t *);
+
+    // Use a float argument type, to check the value isn't clobbered by smstart.
+    // Use a float return type to check the value isn't clobbered by smstop.
+    float __attribute__((noinline, arm_locally_streaming)) foo(float arg) {
+      // Create local for SVE vector to check local is created with correct
+      // size when not yet in streaming mode (ADDSVL).
+      float array[N];
+      svfloat32_t vector;
+
+      some_use(&vector);
+      svst1_f32(svptrue_b32(), &array[0], vector);
+      return array[N - 1] + arg;
+    }
+
+should use ADDSVL for allocating the stack space and should avoid clobbering
+the return/argument values.
+
+.. code-block:: none
+
+    _Z3foof:                                // @_Z3foof
+    // %bb.0:                               // %entry
+            stp     d15, d14, [sp, #-96]!           // 16-byte Folded Spill
+            stp     d13, d12, [sp, #16]             // 16-byte Folded Spill
+            stp     d11, d10, [sp, #32]             // 16-byte Folded Spill
+            stp     d9, d8, [sp, #48]               // 16-byte Folded Spill
+            stp     x29, x30, [sp, #64]             // 16-byte Folded Spill
+            add     x29, sp, #64
+            str     x28, [sp, #80]                  // 8-byte Folded Spill
+            addsvl  sp, sp, #-1
+            sub     sp, sp, #256
+            str     s0, [x29, #28]                  // 4-byte Folded Spill
+            smstart sm
+            sub     x0, x29, #64
+            addsvl  x0, x0, #-1
+            bl      _Z10some_usePu13__SVFloat32_t
+            sub     x8, x29, #64
+            ptrue   p0.s
+            ld1w    { z0.s }, p0/z, [x8, #-1, mul vl]
+            ldr     s1, [x29, #28]                  // 4-byte Folded Reload
+            st1w    { z0.s }, p0, [sp]
+            ldr     s0, [sp, #252]
+            fadd    s0, s0, s1
+            str     s0, [x29, #28]                  // 4-byte Folded Spill
+            smstop  sm
+            ldr     s0, [x29, #28]                  // 4-byte Folded Reload
+            addsvl  sp, sp, #1
+            add     sp, sp, #256
+            ldp     x29, x30, [sp, #64]             // 16-byte Folded Reload
+            ldp     d9, d8, [sp, #48]               // 16-byte Folded Reload
+            ldp     d11, d10, [sp, #32]             // 16-byte Folded Reload
+            ldp     d13, d12, [sp, #16]             // 16-byte Folded Reload
+            ldr     x28, [sp, #80]                  // 8-byte Folded Reload
+            ldp     d15, d14, [sp], #96             // 16-byte Folded Reload
+            ret
+
+
+Preventing the use of illegal instructions in Streaming Mode
+------------------------------------------------------------
+
+* When executing a program in streaming-mode (PSTATE.SM=1) a subset of SVE/SVE2
+  instructions and most AdvSIMD/NEON instructions are invalid.
+
+* When executing a program in normal mode (PSTATE.SM=0), a subset of SME
+  instructions are invalid.
+
+* Streaming-compatible functions must only use instructions that are valid when
+  either PSTATE.SM=0 or PSTATE.SM=1.
+
+The value of PSTATE.SM is not controlled by the feature flags, but rather by the
+function attributes. This means that we can compile for '``+sme``' and the compiler
+will code-generate any instructions, even if they are not legal under the requested
+streaming mode. The compiler needs to use the function attributes to ensure the
+compiler doesn't do transformations under the assumption that certain operations
+are available at runtime.
+
+We made a conscious choice not to model this with feature flags, because we
+still want to support inline-asm in either mode (with the user placing
+smstart/smstop manually), and this became rather complicated to implement at the
+individual instruction level (see `D120261 <https://reviews.llvm.org/D120261>`_
+and `D121208 <https://reviews.llvm.org/D121208>`_) because of limitations in
+TableGen.
+
+As a first step, this means we'll disable vectorization (LoopVectorize/SLP)
+entirely when the a function has either of the ``aarch64_pstate_sm_enabled``,
+``aarch64_pstate_sm_body`` or ``aarch64_pstate_sm_compatible`` attributes,
+in order to avoid the use of vector instructions.
+
+Later on we'll aim to relax these restrictions to enable scalable
+auto-vectorization with a subset of streaming-compatible instructions, but that
+requires changes to the CostModel, Legalization and SelectionDAG lowering.
+
+We will also emit diagnostics in Clang to prevent the use of
+non-streaming(-compatible) operations, e.g. through ACLE intrinsics, when a
+function is decorated with the streaming mode attributes.
+
+
+Other things to consider
+------------------------
+
+* Inlining must be disabled when the call-site needs to toggle PSTATE.SM or
+  when the callee's function body is executed in a 
diff erent streaming mode than
+  its caller. This is needed because function calls are the boundaries for
+  streaming mode changes.
+
+* Tail call optimization must be disabled when the call-site needs to toggle
+  PSTATE.SM, such that the caller can restore the original value of PSTATE.SM.
+
+
+3. Handling PSTATE.ZA
+=====================
+
+In contrast to PSTATE.SM, enabling PSTATE.ZA does not affect the SVE vector
+length and also doesn't clobber FP/AdvSIMD/SVE registers. This means it is safe
+to toggle PSTATE.ZA using intrinsics. This also makes it simpler to setup a
+lazy-save mechanism for calls to private-ZA functions (i.e. functions that may
+either directly or indirectly clobber ZA state).
+
+For this purpose, we'll introduce a new LLVM IR pass that is run just before
+SelectionDAG.
+
+Setting up a lazy-save
+----------------------
+
+Committing a lazy-save
+----------------------
+
+Exception handling and ZA
+-------------------------
+
+4. References
+=============
+
+    .. _aarch64_sme_acle:
+
+1.  `SME ACLE Pull-request <https://github.com/ARM-software/acle/pull/188>`__
+
+    .. _aarch64_sme_abi:
+
+2.  `SME ABI Pull-request <https://github.com/ARM-software/abi-aa/pull/123>`__

diff  --git a/llvm/docs/UserGuides.rst b/llvm/docs/UserGuides.rst
index 3e5252d81315f..517a23e959206 100644
--- a/llvm/docs/UserGuides.rst
+++ b/llvm/docs/UserGuides.rst
@@ -12,6 +12,7 @@ intermediate LLVM representation.
 .. toctree::
    :hidden:
 
+   AArch64SME
    AddingConstrainedIntrinsics
    AdvancedBuilds
    AliasAnalysis
@@ -229,6 +230,9 @@ Additional Topics
   LLVM's support for generating NEON instructions on big endian ARM targets is
   somewhat nonintuitive. This document explains the implementation and rationale.
 
+:doc:`AArch64SME`
+  LLVM's support for AArch64 SME ACLE and ABI.
+
 :doc:`CompileCudaWithLLVM`
   LLVM support for CUDA.