[libc-commits] [libc] [libc] Search empty bits after failed allocation (PR #149910)
Joseph Huber via libc-commits
libc-commits at lists.llvm.org
Mon Jul 21 14:49:15 PDT 2025
https://github.com/jhuber6 created https://github.com/llvm/llvm-project/pull/149910
Summary:
The scheme we use to find a free bit is to just do a random walk. This
works very well up until you start to completely saturate the bitfield.
Because the result of the fetch_or yields the previous value, we can
search this to go to any known empty bits as our next guess. This
effectively increases our liklihood of finding a match after two tries
by 32x since the distribution is random.
This *massively* improves performance when a lot of memory is allocated
without freeing, as it now doesn't takea one in a million shot to fill
that last bit. A further change could improve this further by only
*mostly* filling the slab, allowing 1% to be free at all times.
>From 71974682b68678ca86ab67c21ea24894915592e2 Mon Sep 17 00:00:00 2001
From: Joseph Huber <huberjn at outlook.com>
Date: Mon, 21 Jul 2025 16:45:55 -0500
Subject: [PATCH] [libc] Search empty bits after failed allocation
Summary:
The scheme we use to find a free bit is to just do a random walk. This
works very well up until you start to completely saturate the bitfield.
Because the result of the fetch_or yields the previous value, we can
search this to go to any known empty bits as our next guess. This
effectively increases our liklihood of finding a match after two tries
by 32x since the distribution is random.
This *massively* improves performance when a lot of memory is allocated
without freeing, as it now doesn't takea one in a million shot to fill
that last bit. A further change could improve this further by only
*mostly* filling the slab, allowing 1% to be free at all times.
---
libc/src/__support/GPU/allocator.cpp | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/libc/src/__support/GPU/allocator.cpp b/libc/src/__support/GPU/allocator.cpp
index 7923fbb2c1c24..a499c2d9b9e59 100644
--- a/libc/src/__support/GPU/allocator.cpp
+++ b/libc/src/__support/GPU/allocator.cpp
@@ -251,12 +251,18 @@ struct Slab {
// The uniform mask represents which lanes contain a uniform target pointer.
// We attempt to place these next to each other.
void *result = nullptr;
+ uint32_t after = ~0u;
+ uint32_t old_index = 0;
for (uint64_t mask = lane_mask; mask;
mask = gpu::ballot(lane_mask, !result)) {
if (result)
continue;
- uint32_t start = gpu::broadcast_value(lane_mask, impl::xorshift32(state));
+ // We try using any known empty bits from the previous attempt first.
+ uint32_t start = gpu::shuffle(mask, cpp::countr_zero(uniform & mask),
+ ~after ? (old_index & ~(BITS_IN_WORD - 1)) +
+ cpp::countr_zero(~after)
+ : impl::xorshift32(state));
uint32_t id = impl::lane_count(uniform & mask);
uint32_t index = (start + id) % usable_bits(chunk_size);
@@ -266,8 +272,9 @@ struct Slab {
// Get the mask of bits destined for the same slot and coalesce it.
uint64_t match = uniform & gpu::match_any(mask, slot);
uint32_t length = cpp::popcount(match);
- uint32_t bitmask = static_cast<uint32_t>((uint64_t(1) << length) - 1)
- << bit;
+ uint32_t bitmask = gpu::shuffle(
+ mask, cpp::countr_zero(match),
+ static_cast<uint32_t>((uint64_t(1) << length) - 1) << bit);
uint32_t before = 0;
if (gpu::get_lane_id() == static_cast<uint32_t>(cpp::countr_zero(match)))
@@ -278,6 +285,9 @@ struct Slab {
result = ptr_from_index(index, chunk_size);
else
sleep_briefly();
+
+ after = before | bitmask;
+ old_index = index;
}
cpp::atomic_thread_fence(cpp::MemoryOrder::ACQUIRE);
More information about the libc-commits
mailing list