[PATCH] D90548: [NFC][AMDGPU] Restructure the AMDGPU memory model description

Mon Nov 2 10:38:58 PST 2020

scott.linder added a comment.

Some small nits, and a note that we had discussed already about widening the table

================
Comment at: llvm/docs/AMDGPUUsage.rst:4151
+instructions that a single thread must execute. The ``s_waitcnt`` and cache
+mnagement insructions such as ``buffer_wbinvl1_vol`` are defined with respect to
+other memory instructions executed by the same thread. This allows them to be
----------------
management

================
Comment at: llvm/docs/AMDGPUUsage.rst:4151
+instructions that a single thread must execute. The ``s_waitcnt`` and cache
+mnagement insructions such as ``buffer_wbinvl1_vol`` are defined with respect to
+other memory instructions executed by the same thread. This allows them to be
----------------
scott.linder wrote:
> management
instructions

================
Comment at: llvm/docs/AMDGPUUsage.rst:4192
+change during the execution of a kernel dispatch it is not legal to perform
+stores, and atomic memory orderings are not meaningful, and all access are
+treated as non-atomic.
----------------
accesses

================
Comment at: llvm/docs/AMDGPUUsage.rst:4224
+     acquire      - If a load atomic/atomicrmw then no following load/load
+                    atomic/store/ store atomic/atomicrmw/fence instruction can
+                    be moved before the acquire.
----------------
Remove extra space

================
Comment at: llvm/docs/AMDGPUUsage.rst:4229
+     release      - If a store atomic/atomicrmw then no preceding load/load
+                    atomic/store/ store atomic/atomicrmw/fence instruction can
+                    be moved after the release.
----------------
Remove extra space

================
Comment at: llvm/docs/AMDGPUUsage.rst:4307

-For GFX10:
-
-* Each agent has multiple shader arrays (SA).
-* Each SA has multiple work-group processors (WGP).
-* Each WGP has multiple compute units (CU).
-* Each CU has multiple SIMDs that execute wavefronts.
-* The wavefronts for a single work-group are executed in the same
-  WGP. In CU wavefront execution mode the wavefronts may be executed by
-  different SIMDs in the same CU. In WGP wavefront execution mode the
-  wavefronts may be executed by different SIMDs in different CUs in the same
-  WGP.
-* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
-  executing on it.
-* All LDS operations of a WGP are performed as wavefront wide operations in a
-  global order and involve no caching. Completion is reported to a wavefront in
-  execution order.
-* The LDS memory has multiple request queues shared by the SIMDs of a
-  WGP. Therefore, the LDS operations performed by different wavefronts of a
-  work-group can be reordered relative to each other, which can result in
-  reordering the visibility of vector memory operations with respect to LDS
-  operations of other wavefronts in the same work-group. A ``s_waitcnt
-  lgkmcnt(0)`` is required to ensure synchronization between LDS operations and
-  vector memory operations between wavefronts of a work-group, but not between
-  operations performed by the same wavefront.
-* The vector memory operations are performed as wavefront wide operations.
-  Completion of load/store/sample operations are reported to a wavefront in
-  execution order of other load/store/sample operations performed by that
-  wavefront.
-* The vector memory operations access a vector L0 cache. There is a single L0
-  cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
-  special action is required for coherence between the lanes of a single
-  wavefront. However, a ``buffer_gl0_inv`` is required for coherence between
-  wavefronts executing in the same work-group as they may be executing on SIMDs
-  of different CUs that access different L0s. A ``buffer_gl0_inv`` is also
-  required for coherence between wavefronts executing in different work-groups
-  as they may be executing on different WGPs.
-* The scalar memory operations access a scalar L0 cache shared by all wavefronts
-  on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
-  operations are used in a restricted way so do not impact the memory model. See
-  :ref:`amdgpu-amdhsa-memory-spaces`.
-* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
-  the same SA. Therefore, no special action is required for coherence between
-  the wavefronts of a single work-group. However, a ``buffer_gl1_inv`` is
-  required for coherence between wavefronts executing in different work-groups
-  as they may be executing on different SAs that access different L1s.
-* The L1 caches have independent quadrants to service disjoint ranges of virtual
-  addresses.
-* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
-  vector and scalar memory operations performed by different wavefronts, whether
-  executing in the same or different work-groups (which may be executing on
-  different CUs accessing different L0s), can be reordered relative to each
-  other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure
-  synchronization between vector memory operations of different wavefronts. It
-  ensures a previous vector memory operation has completed before executing a
-  subsequent vector memory or LDS operation and so can be used to meet the
-  requirements of acquire, release and sequential consistency.
-* The L1 caches use an L2 cache shared by all SAs on the same agent.
-* The L2 cache has independent channels to service disjoint ranges of virtual
-  addresses.
-* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
-  quadrant has a separate request queue per L2 channel. Therefore, the vector
-  and scalar memory operations performed by wavefronts executing in different
-  work-groups (which may be executing on different SAs) of an agent can be
-  reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is
-  required to ensure synchronization between vector memory operations of
-  different SAs. It ensures a previous vector memory operation has completed
-  before executing a subsequent vector memory and so can be used to meet the
-  requirements of acquire, release and sequential consistency.
-* The L2 cache can be kept coherent with other agents on some targets, or ranges
-  of virtual addresses can be set up to bypass it to ensure system coherence.
-
-Private address space uses ``buffer_load/store`` using the scratch V#
-(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread
-is accessing the memory, atomic memory orderings are not meaningful, and all
-accesses are treated as non-atomic.
-
-Constant address space uses ``buffer/global_load`` instructions (or equivalent
-scalar memory instructions). Since the constant address space contents do not
-change during the execution of a kernel dispatch it is not legal to perform
-stores, and atomic memory orderings are not meaningful, and all access are
-treated as non-atomic.
-
-A memory synchronization scope wider than work-group is not meaningful for the
-group (LDS) address space and is treated as work-group.
-
-The memory model does not support the region address space which is treated as
-non-atomic.
-
-Acquire memory ordering is not meaningful on store atomic instructions and is
-treated as non-atomic.
-
-Release memory ordering is not meaningful on load atomic instructions and is
-treated a non-atomic.
-
-Acquire-release memory ordering is not meaningful on load or store atomic
-instructions and is treated as acquire and release respectively.
-
-AMDGPU backend only uses scalar memory operations to access memory that is
-proven to not change during the execution of the kernel dispatch. This includes
-constant address space and global address space for program scope const
-variables. Therefore, the kernel machine code does not have to maintain the
-scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
-and vector L1 caches are invalidated between kernel dispatches by CP since
-constant address space data may change between kernel dispatch executions. See
+Scalar memory operations are only ued to access memory that is proven to not
+change during the execution of the kernel dispatch. This includes constant
----------------
used

================
Comment at: llvm/docs/AMDGPUUsage.rst:4309
+change during the execution of the kernel dispatch. This includes constant
+address space and global address space for program scope const variables.
+Therefore, the kernel machine code does not have to maintain the scalar cache to
----------------
Should this be either "constant" or `const` (not sure how to include literal backticks in Phabricator markup)

================
Comment at: llvm/docs/AMDGPUUsage.rst:4384
+     atomicrmw    unordered    *any*          *any*      *Same as monotonic
+                                                         atomic*.
      **Monotonic Atomic**
----------------
Can avoid wrapping this

================
Comment at: llvm/docs/AMDGPUUsage.rst:4837-4846
+                                                           - Must happen before
+                                                             any following store
+                                                             atomic/atomicrmw
+                                                             with an equal or
+                                                             wider sync scope
+                                                             and memory ordering
+                                                             stronger than
----------------
It can be something we do down the line, but I think making the table wider would make it easier to digest. My screen is wider than it is tall, so I would gladly sacrifice some horizontal real-estate in order to see each row in its entirety without scrolling. This would also make it easier to read some of these longer bullets, and it would compress the number of lines in the file each table takes up. For example, wrapping at 120 columns this table shrinks from ~1200 source lines to ~600.

================
Comment at: llvm/docs/AMDGPUUsage.rst:5560
+
+Scalar memory operations are only ued to access memory that is proven to not
+change during the execution of the kernel dispatch. This includes constant
----------------
used

================
Comment at: llvm/docs/AMDGPUUsage.rst:5562
+change during the execution of the kernel dispatch. This includes constant
+address space and global address space for program scope const variables.
+Therefore, the kernel machine code does not have to maintain the scalar cache to
----------------
Same question as above

================
Comment at: llvm/docs/AMDGPUUsage.rst:5769
+                                                             data read is no
+                                                             older than a local load
+                                                             atomic value being
----------------
Is the change from "the" to "a" significant?

================
Comment at: llvm/docs/AMDGPUUsage.rst:5900
+                                                             data read is no
+                                                             older than a local
+                                                             atomicrmw value
----------------
Same question here

================
Comment at: llvm/docs/AMDGPUUsage.rst:6300
+                                                             memory operations
+                                                             to memory have
+                                                             completed before
----------------
Redundant, delete

================
Comment at: llvm/docs/AMDGPUUsage.rst:6395
+                                                         2. ds_atomic
+     atomicrmw    release      - agent        - global   1. s_waitcnt lkkmcnt(0) &
+                               - system       - generic      vmcnt(0) & vscnt(0)
----------------
lgkmcnt

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D90548/new/

https://reviews.llvm.org/D90548