[llvm] [AMDGPU] Only emit SCOPE_SYS global_wb (PR #110636)
via llvm-commits
llvm-commits at lists.llvm.org
Wed Oct 2 04:32:36 PDT 2024
llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-backend-amdgpu
Author: Pierre van Houtryve (Pierre-vh)
<details>
<summary>Changes</summary>
global_wb with scopes lower than SCOPE_SYS is unnecessary for correctness.
I was initially optimistic they would be very cheap no-ops but they can actually be quite expensive so let's avoid them.
---
Patch is 687.25 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/110636.diff
38 Files Affected:
- (modified) llvm/docs/AMDGPUUsage.rst (+126-208)
- (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+7-29)
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll (+3-18)
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll (+3-18)
- (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mubuf-global.ll (-10)
- (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (-32)
- (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (-3)
- (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+29-55)
- (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+46-58)
- (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+46-58)
- (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (-66)
- (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (-50)
- (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (-50)
- (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (-46)
- (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64.ll (-107)
- (modified) llvm/test/CodeGen/AMDGPU/fp-atomics-gfx940.ll (-4)
- (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (-80)
- (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (-50)
- (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (-50)
- (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (-46)
- (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64.ll (-97)
- (modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (-3)
- (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (-33)
- (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (-30)
- (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (-30)
- (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (-30)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll (-18)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll (-18)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll (-116)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll (-1)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll (-54)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll (-114)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll (-1)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll (-58)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll (-29)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll (-29)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll (-1)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll (-29)
``````````diff
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 9e11b13c101d47..bfac4738732631 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -14182,8 +14182,13 @@ For GFX12:
* ``global_inv`` invalidates caches whose scope is strictly smaller than the
instruction's. The invalidation requests cannot be reordered with pending or
upcoming memory operations.
-* ``global_wb`` additionally ensures that previous memory operation done at
- a lower scope level have reached the ``SCOPE:`` of the ``global_wb``.
+* ``global_wb`` is a writeback operation that additionally ensures previous
+ memory operation done at a lower scope level have reached the ``SCOPE:``
+ of the ``global_wb``.
+
+ * ``global_wb`` can be omitted for scopes other than ``SCOPE_SYS`` in
+ gfx120x.
+
* The vector memory operations access a vector L0 cache. There is a single L0
cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
special action is required for coherence between the lanes of a single
@@ -14890,19 +14895,7 @@ the instruction in the code sequence that references the table.
store atomic release - singlethread - global 1. buffer/global/ds/flat_store
- wavefront - local
- generic
- store atomic release - workgroup - global 1. ``global_wb scope:SCOPE_SE``
-
- - If CU wavefront execution
- mode, omit.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at workgroup
- scope before performing the
- store that is being
- released.
-
- 2. | ``s_wait_bvhcnt 0x0``
+ store atomic release - workgroup - global 1. | ``s_wait_bvhcnt 0x0``
| ``s_wait_samplecnt 0x0``
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
@@ -14925,7 +14918,11 @@ the instruction in the code sequence that references the table.
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
must happen after
- ``global_wb``.
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
- ``s_wait_dscnt 0x0``
must happen after
any preceding
@@ -14945,19 +14942,7 @@ the instruction in the code sequence that references the table.
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- store atomic release - workgroup - local 1. ``global_wb scope:SCOPE_SE``
-
- - If CU wavefront execution
- mode or OpenCL, omit.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at workgroup
- scope before performing the
- store that is being
- released.
-
- 2. | ``s_wait_bvhcnt 0x0``
+ store atomic release - workgroup - local 1. | ``s_wait_bvhcnt 0x0``
| ``s_wait_samplecnt 0x0``
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
@@ -14980,7 +14965,11 @@ the instruction in the code sequence that references the table.
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
must happen after
- ``global_wb``.
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
- Must happen before the
following store.
- Ensures that all
@@ -14992,16 +14981,9 @@ the instruction in the code sequence that references the table.
released.
3. ds_store
- store atomic release - agent - global 1. ``global_wb``
+ store atomic release - agent - global 1. ``global_wb scope:SCOPE_SYS``
- system - generic
- - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at agent or system
- scope before performing the
- store that is being
- released.
+ - If agent scope, omit.
2. | ``s_wait_bvhcnt 0x0``
| ``s_wait_samplecnt 0x0``
@@ -15025,7 +15007,12 @@ the instruction in the code sequence that references the table.
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
must happen after
- ``global_wb``.
+ ``global_wb`` if present, or
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
- ``s_wait_dscnt 0x0``
must happen after
any preceding
@@ -15050,20 +15037,8 @@ the instruction in the code sequence that references the table.
atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
- wavefront - local
- generic
- atomicrmw release - workgroup - global 1. ``global_wb scope:SCOPE_SE``
- - generic
- - If CU wavefront execution
- mode, omit.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at workgroup
- scope before performing the
- store that is being
- released.
-
- 2. | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
+ atomicrmw release - workgroup - global 1. | ``s_wait_bvhcnt 0x0``
+ - generic | ``s_wait_samplecnt 0x0``
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
@@ -15086,15 +15061,19 @@ the instruction in the code sequence that references the table.
atomic/
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
- must happen after
- ``global_wb``.
+ must happen after
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
- ``s_wait_dscnt 0x0``
- must happen after
- any preceding
- local/generic
- load/store/load
- atomic/store
- atomic/atomicrmw.
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
- Must happen before the
following atomic.
- Ensures that all
@@ -15105,23 +15084,11 @@ the instruction in the code sequence that references the table.
atomicrmw that is
being released.
- 3. buffer/global/flat_atomic
+ 2. buffer/global/flat_atomic
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- atomicrmw release - workgroup - local 1. ``global_wb scope:SCOPE_SE``
-
- - If CU wavefront execution
- mode or OpenCL, omit.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at workgroup
- scope before performing the
- store that is being
- released.
-
- 2. | ``s_wait_bvhcnt 0x0``
+ atomicrmw release - workgroup - local 1. | ``s_wait_bvhcnt 0x0``
| ``s_wait_samplecnt 0x0``
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
@@ -15144,7 +15111,11 @@ the instruction in the code sequence that references the table.
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
must happen after
- ``global_wb``.
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
- Must happen before the
following atomic.
- Ensures that all
@@ -15155,17 +15126,10 @@ the instruction in the code sequence that references the table.
store that is being
released.
- 3. ds_atomic
- atomicrmw release - agent - global 1. ``global_wb scope:``
+ 2. ds_atomic
+ atomicrmw release - agent - global 1. ``global_wb scope:SCOPE_SYS``
- system - generic
- - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at agent or system
- scope before performing the
- store that is being
- released.
+ - If agent scope, omit.
2. | ``s_wait_bvhcnt 0x0``
| ``s_wait_samplecnt 0x0``
@@ -15188,7 +15152,12 @@ the instruction in the code sequence that references the table.
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
must happen after
- ``global_wb``
+ ``global_wb`` if present, or
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
- ``s_wait_dscnt 0x0``
must happen after
any preceding
@@ -15212,19 +15181,7 @@ the instruction in the code sequence that references the table.
fence release - singlethread *none* *none*
- wavefront
- fence release - workgroup *none* 1. ``global_wb scope:SCOPE_SE``
-
- - If CU wavefront execution
- mode, omit.
- - In combination with the waits
- below, ensures that all
- memory operations
- have completed at workgroup
- scope before performing the
- store that is being
- released.
-
- 2. | ``s_wait_bvhcnt 0x0``
+ fence release - workgroup *none* 1. | ``s_wait_bvhcnt 0x0``
| ``s_wait_samplecnt 0x0``
| ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
@@ -15254,7 +15211,11 @@ the instruction in the code sequence that references the table.
atomicrmw-with-return-value.
- ``s_wait_storecnt 0x0``
must happen after
- ``global_wb``
+ ...
[truncated]
``````````
</details>
https://github.com/llvm/llvm-project/pull/110636
More information about the llvm-commits
mailing list