[all-commits] [llvm/llvm-project] 07d47c: [AMDGPU] Update code sequence for CU-mode Release ...

Tue Oct 21 00:24:08 PDT 2025

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 07d47c792b980746ab1ff5ea3f346c87b024bd51
      https://github.com/llvm/llvm-project/commit/07d47c792b980746ab1ff5ea3f346c87b024bd51
  Author: Pierre van Houtryve <pierre.vanhoutryve at amd.com>
  Date:   2025-10-21 (Tue, 21 Oct 2025)

  Changed paths:
    M llvm/docs/AMDGPUUsage.rst
    M llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
    M llvm/test/CodeGen/AMDGPU/GlobalISel/memory-legalizer-atomic-fence.ll
    M llvm/test/CodeGen/AMDGPU/lds-dma-workgroup-release.ll
    M llvm/test/CodeGen/AMDGPU/memory-legalizer-barriers.ll
    M llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll
    M llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll
    M llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll
    M llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll
    M llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll
    M llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll
    M llvm/test/CodeGen/AMDGPU/memory-legalizer-local-agent.ll
    M llvm/test/CodeGen/AMDGPU/memory-legalizer-local-cluster.ll
    M llvm/test/CodeGen/AMDGPU/memory-legalizer-local-system.ll
    M llvm/test/CodeGen/AMDGPU/memory-legalizer-local-volatile.ll
    M llvm/test/CodeGen/AMDGPU/memory-legalizer-local-workgroup.ll

  Log Message:
  -----------
  [AMDGPU] Update code sequence for CU-mode Release Fences in GFX10+ (#161638)

They were previously optimized to not emit any waitcnt, which is
technically correct because there is no reordering of operations at
workgroup scope in CU mode for GFX10+.

This breaks transitivity however, for example if we have the following
sequence of events in one thread:

- some stores
- store atomic release syncscope("workgroup")
- barrier

then another thread follows with

- barrier
- load atomic acquire
- store atomic release syncscope("agent")

It does not work because, while the other thread sees the stores, it
cannot release them at the wider scope. Our release fences aren't strong
enough to "wait" on stores from other waves.

We also cannot strengthen our release fences any further to allow for
releasing other wave's stores because only GFX12 can do that with
`global_wb`. GFX10-11 do not have the writeback instruction.
It'd also add yet another level of complexity to code sequences, with
both acquire/release having CU-mode only alternatives.
Lastly, acq/rel are always used together. The price for synchronization
has to be paid either at the acq, or the rel. Strengthening the releases
would just make the memory model more complex but wouldn't help
performance.

So the choice here is to streamline the code sequences by making CU and
WGP mode emit almost identical (vL0 inv is not needed in CU mode) code
for release (or stronger) atomic ordering.

This also removes the `vm_vsrc(0)` wait before barriers. Now that the
release fence in CU mode is strong enough, it is no longer needed.

Supersedes #160501
Solves SC1-6454

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications