[libcxx-commits] [libcxx] [libc++] Optimize bitset::to_string (PR #128832)

Peng Liu via libcxx-commits libcxx-commits at lists.llvm.org
Tue Feb 25 22:42:22 PST 2025


https://github.com/winner245 created https://github.com/llvm/llvm-project/pull/128832

This patch optimizes `bitset::to_string` by replacing the conventional bit-by-bit iteration with a more efficient bit traversal strategy. Instead of checking each bit sequentially, we leverage `std::__countr_zero` to efficiently locate the next set bit, skipping over consecutive zero bits. This greatly accelerates the conversion process, especially for sparse `bitset`s where zero bits dominate. To ensure similar improvements for dense `bitset`s, we exploit symmetry by inverting the bit pattern, allowing us to apply the same optimized traversal technique. Even for uniformly distributed `bitset`s, the proposed approach offers measurable performance gains over the existing implementation.

Benchmarks demonstrate substantial improvements, achieving up to **6.6x** speedup for sparse `bitset`s, **10.4x** for dense `bitset`s, and **1.8x** for uniformly distributed `bitset`s.

 


##### Sparse case (10% 1 bits)

```
--------------------------------------------------------------------------------------
Benchmark                                            Before         After  Improvement
--------------------------------------------------------------------------------------
BM_BitsetToString<32>/Sparse(10 %)/10               18.7 ns       17.7 ns        1.1x
BM_BitsetToString<64>/Sparse(10 %)/10               40.5 ns       16.0 ns        2.5x
BM_BitsetToString<128>/Sparse(10 %)/10              69.8 ns       22.3 ns        3.1x
BM_BitsetToString<256>/Sparse(10 %)/10               129 ns       29.1 ns        4.4x
BM_BitsetToString<512>/Sparse(10 %)/10               277 ns       45.9 ns        6.0x
BM_BitsetToString<1024>/Sparse(10 %)/10              535 ns        112 ns        4.8x
BM_BitsetToString<2048>/Sparse(10 %)/10             1004 ns        175 ns        5.7x
BM_BitsetToString<4096>/Sparse(10 %)/10             1967 ns        418 ns        4.7x
BM_BitsetToString<8192>/Sparse(10 %)/10             4064 ns        618 ns        6.6x
BM_BitsetToString<16384>/Sparse(10 %)/10            8280 ns       1503 ns        5.5x
BM_BitsetToString<32768>/Sparse(10 %)/10           15476 ns       2409 ns        6.4x
BM_BitsetToString<65536>/Sparse(10 %)/10           31873 ns       6486 ns        4.9x
BM_BitsetToString<131072>/Sparse(10 %)/10          64303 ns      10186 ns        6.3x
BM_BitsetToString<262144>/Sparse(10 %)/10         134330 ns      25555 ns        5.3x
BM_BitsetToString<524288>/Sparse(10 %)/10         253769 ns      41379 ns        6.1x
BM_BitsetToString<1048576>/Sparse(10 %)/10        517276 ns     103079 ns        5.0x
```

##### Dense case (90% 1 bits)

```
--------------------------------------------------------------------------------------
Benchmark                                            Before         After Improvement
--------------------------------------------------------------------------------------
BM_BitsetToString<32>/Dense(90 %)/90                25.1 ns       15.9 ns        1.6x
BM_BitsetToString<64>/Dense(90 %)/90                45.8 ns       19.2 ns        2.4x
BM_BitsetToString<128>/Dense(90 %)/90               96.6 ns       22.6 ns        4.3x
BM_BitsetToString<256>/Dense(90 %)/90                187 ns       31.8 ns        5.9x
BM_BitsetToString<512>/Dense(90 %)/90                374 ns       45.3 ns        8.3x
BM_BitsetToString<1024>/Dense(90 %)/90               750 ns       89.4 ns        8.4x
BM_BitsetToString<2048>/Dense(90 %)/90              1292 ns        190 ns        6.8x
BM_BitsetToString<4096>/Dense(90 %)/90              2557 ns        371 ns        6.9x
BM_BitsetToString<8192>/Dense(90 %)/90              5721 ns        666 ns        8.6x
BM_BitsetToString<16384>/Dense(90 %)/90            11480 ns       1225 ns        9.4x
BM_BitsetToString<32768>/Dense(90 %)/90            19835 ns       2557 ns        7.8x
BM_BitsetToString<65536>/Dense(90 %)/90            46761 ns       5040 ns        9.3x
BM_BitsetToString<131072>/Dense(90 %)/90           91796 ns      10822 ns        8.5x
BM_BitsetToString<262144>/Dense(90 %)/90          185850 ns      21172 ns        8.8x
BM_BitsetToString<524288>/Dense(90 %)/90          328253 ns      43810 ns        7.5x
BM_BitsetToString<1048576>/Dense(90 %)/90         898541 ns      86344 ns       10.4x
```


##### Uniform case (50% 1 bits)

```
--------------------------------------------------------------------------------------
Benchmark                                            Before         After  Improvement
--------------------------------------------------------------------------------------
BM_BitsetToString<32>/Uniform(50 %)/50              23.7 ns       21.5 ns        1.1x
BM_BitsetToString<64>/Uniform(50 %)/50              55.9 ns       40.7 ns        1.4x
BM_BitsetToString<128>/Uniform(50 %)/50             87.0 ns       48.7 ns        1.8x
BM_BitsetToString<256>/Uniform(50 %)/50             156 ns         120 ns        1.3x
BM_BitsetToString<512>/Uniform(50 %)/50             296 ns         151 ns        2.0x
BM_BitsetToString<1024>/Uniform(50 %)/50            569 ns         421 ns        1.4x
BM_BitsetToString<2048>/Uniform(50 %)/50           1142 ns         903 ns        1.3x
BM_BitsetToString<4096>/Uniform(50 %)/50           2211 ns        1378 ns        1.6x
BM_BitsetToString<8192>/Uniform(50 %)/50           4430 ns        3619 ns        1.2x
BM_BitsetToString<16384>/Uniform(50 %)/50          8871 ns        5894 ns        1.5x
BM_BitsetToString<32768>/Uniform(50 %)/50         17505 ns       13420 ns        1.3x
BM_BitsetToString<65536>/Uniform(50 %)/50         35055 ns       24498 ns        1.4x
BM_BitsetToString<131072>/Uniform(50 %)/50        70637 ns       56697 ns        1.2x
BM_BitsetToString<262144>/Uniform(50 %)/50       141838 ns       89614 ns        1.6x
BM_BitsetToString<524288>/Uniform(50 %)/50       284197 ns      220883 ns        1.3x
BM_BitsetToString<1048576>/Uniform(50 %)/50      569476 ns      359686 ns        1.6x


>From e6977a5b1c5a58a5e03d16b1d3048f5fd070a9f8 Mon Sep 17 00:00:00 2001
From: Peng Liu <winner245 at hotmail.com>
Date: Mon, 24 Feb 2025 22:37:04 -0500
Subject: [PATCH] Optimize bitset::to_string

---
 libcxx/include/bitset                   |  58 ++++++++++--
 libcxx/test/benchmarks/bitset.bench.cpp | 113 ++++++++++++++++++++++++
 2 files changed, 165 insertions(+), 6 deletions(-)
 create mode 100644 libcxx/test/benchmarks/bitset.bench.cpp

diff --git a/libcxx/include/bitset b/libcxx/include/bitset
index ab1dda739c7d5..33aebeef48908 100644
--- a/libcxx/include/bitset
+++ b/libcxx/include/bitset
@@ -136,6 +136,8 @@ template <size_t N> struct hash<std::bitset<N>>;
 #  include <__algorithm/fill_n.h>
 #  include <__algorithm/find.h>
 #  include <__assert>
+#  include <__bit/countr.h>
+#  include <__bit/invert_if.h>
 #  include <__bit_reference>
 #  include <__config>
 #  include <__cstddef/ptrdiff_t.h>
@@ -223,6 +225,10 @@ protected:
     return to_ullong(integral_constant < bool, _Size< sizeof(unsigned long long) * CHAR_BIT>());
   }
 
+  template <bool _Spare, class _CharT, class _Traits, class _Allocator>
+  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 basic_string<_CharT, _Traits, _Allocator>
+  __to_string(_CharT __zero, _CharT __one) const;
+
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 bool all() const _NOEXCEPT;
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 bool any() const _NOEXCEPT;
   _LIBCPP_HIDE_FROM_ABI size_t __hash_code() const _NOEXCEPT;
@@ -389,6 +395,22 @@ __bitset<_N_words, _Size>::to_ullong(true_type, true_type) const {
   return __r;
 }
 
+template <size_t _N_words, size_t _Size>
+template <bool _Spare, class _CharT, class _Traits, class _Allocator>
+_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 basic_string<_CharT, _Traits, _Allocator>
+__bitset<_N_words, _Size>::__to_string(_CharT __zero, _CharT __one) const {
+  basic_string<_CharT, _Traits, _Allocator> __r(_Size, __zero);
+  for (size_t __i = 0, __bits = 0; __i < _N_words; ++__i, __bits += __bits_per_word) {
+    __storage_type __word = std::__invert_if<!_Spare>(__first_[__i]);
+    if (__i == _N_words - 1 && _Size - __bits < __bits_per_word)
+      __word &= (__storage_type(1) << (_Size - __bits)) - 1;
+    for (; __word; __word &= (__word - 1))
+      __r[_Size - 1 - (__bits + std::__countr_zero(__word))] = __one;
+  }
+
+  return __r;
+}
+
 template <size_t _N_words, size_t _Size>
 _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 bool __bitset<_N_words, _Size>::all() const _NOEXCEPT {
   // do middle whole words
@@ -480,6 +502,10 @@ protected:
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 unsigned long to_ulong() const;
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 unsigned long long to_ullong() const;
 
+  template <bool _Sparse, class _CharT, class _Traits, class _Allocator>
+  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 basic_string<_CharT, _Traits, _Allocator>
+  __to_string(_CharT __zero, _CharT __one) const;
+
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 bool all() const _NOEXCEPT;
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 bool any() const _NOEXCEPT;
 
@@ -529,6 +555,21 @@ inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 unsigned long long __
   return __first_;
 }
 
+template <size_t _Size>
+template <bool _Spare, class _CharT, class _Traits, class _Allocator>
+_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 basic_string<_CharT, _Traits, _Allocator>
+__bitset<1, _Size>::__to_string(_CharT __zero, _CharT __one) const {
+  basic_string<_CharT, _Traits, _Allocator> __r(_Size, __zero);
+  __storage_type __word = std::__invert_if<!_Spare>(__first_);
+  if (_Size < __bits_per_word)
+    __word &= (__storage_type(1) << _Size) - 1;
+  for (; __word; __word &= (__word - 1)) {
+    size_t __pos           = std::__countr_zero(__word);
+    __r[_Size - 1 - __pos] = __one;
+  }
+  return __r;
+}
+
 template <size_t _Size>
 inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 bool __bitset<1, _Size>::all() const _NOEXCEPT {
   __storage_type __m = ~__storage_type(0) >> (__bits_per_word - _Size);
@@ -593,6 +634,12 @@ protected:
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 unsigned long to_ulong() const { return 0; }
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 unsigned long long to_ullong() const { return 0; }
 
+  template <bool _Spare, class _CharT, class _Traits, class _Allocator>
+  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 basic_string<_CharT, _Traits, _Allocator>
+  __to_string(_CharT, _CharT) const {
+    return basic_string<_CharT, _Traits, _Allocator>();
+  }
+
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 bool all() const _NOEXCEPT { return true; }
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 bool any() const _NOEXCEPT { return false; }
 
@@ -848,12 +895,11 @@ template <size_t _Size>
 template <class _CharT, class _Traits, class _Allocator>
 _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX23 basic_string<_CharT, _Traits, _Allocator>
 bitset<_Size>::to_string(_CharT __zero, _CharT __one) const {
-  basic_string<_CharT, _Traits, _Allocator> __r(_Size, __zero);
-  for (size_t __i = 0; __i != _Size; ++__i) {
-    if ((*this)[__i])
-      __r[_Size - 1 - __i] = __one;
-  }
-  return __r;
+  bool __sparse = size_t(std::count(__base::__make_iter(0), __base::__make_iter(_Size), true)) < _Size / 2;
+  if (__sparse)
+    return __base::template __to_string<true, _CharT, _Traits, _Allocator>(__zero, __one);
+  else
+    return __base::template __to_string<false, _CharT, _Traits, _Allocator>(__one, __zero);
 }
 
 template <size_t _Size>
diff --git a/libcxx/test/benchmarks/bitset.bench.cpp b/libcxx/test/benchmarks/bitset.bench.cpp
new file mode 100644
index 0000000000000..0dfdc7d430d14
--- /dev/null
+++ b/libcxx/test/benchmarks/bitset.bench.cpp
@@ -0,0 +1,113 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include <benchmark/benchmark.h>
+#include <bitset>
+#include <cmath>
+#include <cstddef>
+
+template <std::size_t N>
+struct GenerateBitset {
+  // Construct a bitset with p*N true bits
+  static std::bitset<N> generate(double p) {
+    std::bitset<N> b;
+    if (p <= 0.0)
+      return b;
+    if (p >= 1.0)
+      return ~b;
+
+    std::size_t num_ones = std::round(N * p);
+    if (num_ones == 0)
+      return b;
+
+    double step  = static_cast<double>(N) / num_ones;
+    double error = 0.0;
+
+    std::size_t pos = 0;
+    for (std::size_t i = 0; i < num_ones; ++i) {
+      if (pos >= N)
+        break;
+      b.set(pos);
+      error += step;
+      pos += std::floor(error);
+      error -= std::floor(error);
+    }
+    return b;
+  }
+
+  static std::bitset<N> sparse() { return generate(0.1); }
+  static std::bitset<N> dense() { return generate(0.9); }
+  static std::bitset<N> uniform() { return generate(0.5); }
+};
+
+template <std::size_t N>
+static void BM_BitsetToString(benchmark::State& state) {
+  double p         = state.range(0) / 100.0;
+  std::bitset<N> b = GenerateBitset<N>::generate(p);
+  benchmark::DoNotOptimize(b);
+
+  for (auto _ : state) {
+    benchmark::DoNotOptimize(b.to_string());
+  }
+}
+
+// Sparse bitset
+BENCHMARK_CAPTURE(BM_BitsetToString<32>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<64>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<128>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<256>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<512>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<1024>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<2048>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<4096>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<8192>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<16384>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<32768>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<65536>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<131072>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<262144>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<524288>, Sparse(10 %))->Arg(10);
+BENCHMARK_CAPTURE(BM_BitsetToString<1048576>, Sparse(10 %))->Arg(10); // 1 << 20
+
+// Dense bitset
+BENCHMARK_CAPTURE(BM_BitsetToString<32>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<64>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<128>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<256>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<512>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<1024>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<2048>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<4096>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<8192>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<16384>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<32768>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<65536>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<131072>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<262144>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<524288>, Dense(90 %))->Arg(90);
+BENCHMARK_CAPTURE(BM_BitsetToString<1048576>, Dense(90 %))->Arg(90); // 1 << 20
+
+// Uniform bitset
+BENCHMARK_CAPTURE(BM_BitsetToString<32>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<64>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<128>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<256>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<512>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<1024>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<2048>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<4096>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<8192>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<16384>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<32768>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<65536>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<131072>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<262144>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<524288>, Uniform(50 %))->Arg(50);
+BENCHMARK_CAPTURE(BM_BitsetToString<1048576>, Uniform(50 %))->Arg(50); // 1 << 20
+
+BENCHMARK_MAIN();



More information about the libcxx-commits mailing list