[all-commits] [llvm/llvm-project] 9975df: [libc] Small performance improvements to GPU alloc...

Mon Jul 28 07:23:55 PDT 2025

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 9975dfdf800d9881b704a988bc004ec81639fe67
      https://github.com/llvm/llvm-project/commit/9975dfdf800d9881b704a988bc004ec81639fe67
  Author: Joseph Huber <huberjn at outlook.com>
  Date:   2025-07-28 (Mon, 28 Jul 2025)

  Changed paths:
    M libc/src/__support/GPU/allocator.cpp
    M libc/test/integration/src/stdlib/gpu/malloc_stress.cpp

  Log Message:
  -----------
  [libc] Small performance improvements to GPU allocator

Summary:
This slightly increases performance in a few places. First, we
optimistically assume the cached slab has ample space which lets us
avoid the atomic load on the highly contended counter in the case that
it is likely to succeed. Second, we no longer call `match_any` twice as
we can calculate the uniform slabs at the moment we return them.
Thirdly, we always choose a random index on a 32-bit boundary. This
means that in the fast case we fulfil the allocation with a single
`fetch_or`, and in the other case we quickly move to the free bit.

This nets around a 7.75% improvement for the fast path case.

  Commit: a7649007ef269c397b5d474d1b5f4432da96d1de
      https://github.com/llvm/llvm-project/commit/a7649007ef269c397b5d474d1b5f4432da96d1de
  Author: Joseph Huber <huberjn at outlook.com>
  Date:   2025-07-28 (Mon, 28 Jul 2025)

  Changed paths:
    M libc/src/__support/GPU/allocator.cpp
    M libc/test/integration/src/stdlib/gpu/malloc_stress.cpp

  Log Message:
  -----------
  [libc] Rework match any use in hot allocate bitfield loop

Summary:
We previously used `match_all` as the shortcut to figure out which
threads were destined for which slots. This lowers to a for-loop, which
even if it often only executes once still causes some slowdown
especially when divergent. Instead we use a single ballot call and then
calculate it.

Here the ballot tells us which lanes are the first in a block, either
the starting index or the barrier for a new 32-bit int. We then use some
bit magic to figure out for each lane ID its closest leader. For the
length we simply use the length calculated by the leader of the
remaining bits to be written. This removes the match any and the
shuffle, which improves the minimum number of cycles this takes by about
5%.

  Commit: a1a610a1285fe4cde9f5f6a4a759da95266bdcb6
      https://github.com/llvm/llvm-project/commit/a1a610a1285fe4cde9f5f6a4a759da95266bdcb6
  Author: Joseph Huber <huberjn at outlook.com>
  Date:   2025-07-28 (Mon, 28 Jul 2025)

  Changed paths:
    M libc/src/__support/GPU/allocator.cpp

  Log Message:
  -----------
  [libc] Increase the number of times we wait on a slab

Summary:
This wait restricts how long we wait on a slab. The only reason this
isn't an infinite loop is to prevent complete deadlocks. However, this
limit was *just* on the cusp of waiting long enough for the allocation
to be done. Just increase this to a sufficiently large value, because
this limit only exists to keep the interface wait-free in the absolute
worst case scheduling scenario. This *MASSIVELY* improved performance
for mixed allocations as we no longer shuffled around creating more than
necessary.

Compare: https://github.com/llvm/llvm-project/compare/166493d69270...a1a610a1285f

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications