[libc-commits] [PATCH] D155511: [libc] Treat the locks array as a bitfield
Joseph Huber via Phabricator via libc-commits
libc-commits at lists.llvm.org
Mon Jul 17 13:08:13 PDT 2023
jhuber6 created this revision.
jhuber6 added reviewers: jdoerfert, JonChesterfield, sivachandra.
Herald added projects: libc-project, All.
Herald added a subscriber: libc-commits.
jhuber6 requested review of this revision.
Currently we keep an internal buffer of device memory that is used to
indicate ownership of a port. Since we only use this as a single bit we
can simply turn this into a bitfield. I did this manually rather than
having a separate type as we need very special handling of the masks
used to interact with the locks.
Repository:
rG LLVM Github Monorepo
https://reviews.llvm.org/D155511
Files:
libc/src/__support/RPC/rpc.h
libc/utils/gpu/server/rpc_server.h
Index: libc/utils/gpu/server/rpc_server.h
===================================================================
--- libc/utils/gpu/server/rpc_server.h
+++ libc/utils/gpu/server/rpc_server.h
@@ -18,7 +18,7 @@
#endif
/// The maxium number of ports that can be opened for any server.
-const uint64_t RPC_MAXIMUM_PORT_COUNT = 64;
+const uint64_t RPC_MAXIMUM_PORT_COUNT = 512;
/// The symbol name associated with the client for use with the LLVM C library
/// implementation.
Index: libc/src/__support/RPC/rpc.h
===================================================================
--- libc/src/__support/RPC/rpc.h
+++ libc/src/__support/RPC/rpc.h
@@ -60,7 +60,7 @@
// of thumb is that you should have at least as many ports as possible
// concurrent work items on the GPU to mitigate the lack offorward
// progress guarantees on the GPU.
-constexpr uint64_t DEFAULT_PORT_COUNT = 64;
+constexpr uint64_t DEFAULT_PORT_COUNT = 512;
/// A common process used to synchronize communication between a client and a
/// server. The process contains a read-only inbox and a write-only outbox used
@@ -87,7 +87,9 @@
cpp::Atomic<uint32_t> *outbox;
Packet *packet;
- cpp::Atomic<uint32_t> lock[DEFAULT_PORT_COUNT] = {0};
+ // We use a bitfield to store the locks.
+ static constexpr uint32_t BITFIELD_SIZE = sizeof(uint32_t) * 8;
+ cpp::Atomic<uint32_t> lock[DEFAULT_PORT_COUNT / sizeof(uint32_t)] = {0};
/// Initialize the communication channels.
LIBC_INLINE void reset(uint64_t port_count, void *buffer) {
@@ -158,10 +160,9 @@
/// Attempt to claim the lock at index. Return true on lock taken.
/// lane_mask is a bitmap of the threads in the warp that would hold the
- /// single lock on success, e.g. the result of gpu::get_lane_mask()
- /// The lock is held when the zeroth bit of the uint32_t at lock[index]
- /// is set, and available when that bit is clear. Bits [1, 32) are zero.
- /// Or with one is a no-op when the lock is already held.
+ /// single lock on success, e.g. the result of gpu::get_lane_mask().
+ /// The lock is held when the nth bit of the lock bitfield is set, otherwise
+ /// it is availible.
[[clang::convergent]] LIBC_INLINE bool try_lock(uint64_t lane_mask,
uint64_t index) {
// On amdgpu, test and set to lock[index] and a sync_lane would suffice
@@ -173,12 +174,14 @@
// succeed in taking the lock, as otherwise it will leak. This is handled
// by making threads which are not in lane_mask or with 0, a no-op.
uint32_t id = gpu::get_lane_id();
- bool id_in_lane_mask = lane_mask & (1ul << id);
+ bool id_in_lane_mask = lane_mask & (1u << id);
// All threads in the warp call fetch_or. Possibly at the same time.
- bool before =
- lock[index].fetch_or(id_in_lane_mask, cpp::MemoryOrder::RELAXED);
- uint64_t packed = gpu::ballot(lane_mask, before);
+ uint32_t or_mask = id_in_lane_mask << (index % BITFIELD_SIZE);
+ uint32_t before = lock[index / BITFIELD_SIZE].fetch_or(
+ or_mask, cpp::MemoryOrder::RELAXED);
+ bool active = before & (1u << (index % BITFIELD_SIZE));
+ uint64_t packed = gpu::ballot(lane_mask, active);
// If every bit set in lane_mask is also set in packed, every single thread
// in the warp failed to get the lock. Ballot returns unset for threads not
@@ -216,8 +219,9 @@
// Must restrict to a single thread to avoid one thread dropping the lock,
// then an unrelated warp claiming the lock, then a second thread in this
// warp dropping the lock again.
- uint32_t and_mask = ~(rpc::is_first_lane(lane_mask) ? 1 : 0);
- lock[index].fetch_and(and_mask, cpp::MemoryOrder::RELAXED);
+ uint32_t and_mask =
+ -1u ^ (rpc::is_first_lane(lane_mask) << (index % BITFIELD_SIZE));
+ lock[index / BITFIELD_SIZE].fetch_and(and_mask, cpp::MemoryOrder::RELAXED);
gpu::sync_lane(lane_mask);
}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D155511.541203.patch
Type: text/x-patch
Size: 3952 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/libc-commits/attachments/20230717/fcaeb021/attachment.bin>
More information about the libc-commits
mailing list