[libcxx-commits] [libcxx] f7c0df0 - [libc++][format] Improve format buffer.

Tue Aug 16 09:54:16 PDT 2022

Author: Mark de Wever
Date: 2022-08-16T18:54:10+02:00
New Revision: f7c0df002a083bcca8ac4972330b8198474a355b

URL: https://github.com/llvm/llvm-project/commit/f7c0df002a083bcca8ac4972330b8198474a355b
DIFF: https://github.com/llvm/llvm-project/commit/f7c0df002a083bcca8ac4972330b8198474a355b.diff

LOG: [libc++][format] Improve format buffer.

Allow bulk output operations on the buffer instead of adding one
code unit at a time. This has a huge performance benefit at the cost of
larger binary. This doesn't implement @vitaut's earlier suggestion to
avoid buffering for std::string when writing a strings. That can be done
in a follow-up patch.

There are some minor complications for the non-buffered format_to_n.
When writing one character at a time it's easy to detect when reaching
the limit n. This is solved by adding a small overhead for format_to_n.
When the next write would overflow it stores the data in the internal
buffer and copies that up-to n code units. The overhead isn't measured,
but it's expected to only be an issue for small values of n; for larger
values the general improvements will outweight the new overhead.

```
   text	   data	    bss	    dec	    hex	filename
 349081	   6096	    440	 355617	  56d21	format.libcxx.out-baseline
 344442	   6088	    440	 350970	  55afa	formatted_size.libcxx.out-baseline
4567980	  57272	    424	4625676	 46950c	formatter_float.libcxx.out-baseline
 718800	  12472	    488	 731760	  b2a70	formatter_int.libcxx.out-baseline
 376341	   6096	    552	 382989	  5d80d	format_to.libcxx.out-beaseline

 370169	   6096	    440	 376705	  5bf81	format.libcxx.out
 365530	   6088	    440	 372058	  5ad5a	formatted_size.libcxx.out
4575116	  57272	    424	4632812	 46b0ec	formatter_float.libcxx.out
 725936	  12472	    488	 738896	  b4650	formatter_int.libcxx.out
 397429	   6096	    552	 404077	  62a6d	format_to.libcxx.out
```

For very small strings the new method is slower, from 4 characters
there's already a small gain.

```
Comparing ./format.libcxx.out-baseline to ./format.libcxx.out
Benchmark                                           Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------
BM_format_string<char>/1                         +0.0268         +0.0268            43            44            43            44
BM_format_string<char>/2                         +0.0133         +0.0133            22            22            22            22
BM_format_string<char>/4                         -0.0248         -0.0248            12            11            12            11
BM_format_string<char>/8                         -0.0831         -0.0831             6             6             6             6
BM_format_string<char>/16                        -0.2976         -0.2976             4             3             4             3
BM_format_string<char>/32                        -0.4369         -0.4369             3             2             3             2
BM_format_string<char>/64                        -0.6375         -0.6375             3             1             3             1
BM_format_string<char>/128                       -0.7685         -0.7685             2             1             2             1

```

The int benchmark has benefits for the simple formatting, but shines for
the complex formatting:
```
Comparing ./formatter_int.libcxx.out-baseline to ./formatter_int.libcxx.out
Benchmark                                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------
BM_Basic<uint32_t>                                                   -0.2307         -0.2307            60            46            60            46
BM_Basic<int32_t>                                                    -0.1985         -0.1985            61            49            61            49
BM_Basic<uint64_t>                                                   -0.3478         -0.3479            81            53            81            53
BM_Basic<int64_t>                                                    -0.3475         -0.3475            81            53            81            53
BM_BasicLow<__uint128_t>                                             -0.3388         -0.3388            86            57            86            57
BM_BasicLow<__int128_t>                                              -0.3431         -0.3431            86            57            86            57
BM_Basic<__uint128_t>                                                -0.2822         -0.2822           236           170           236           170
BM_Basic<__int128_t>                                                 -0.3107         -0.3107           219           151           219           151
Integral_LocFalse_BaseBin_AlignNone_Int64                            -0.5781         -0.5781           178            75           178            75
Integral_LocFalse_BaseBin_AlignmentLeft_Int64                        -0.9231         -0.9231          1156            89          1156            89
Integral_LocFalse_BaseBin_AlignmentCenter_Int64                      -0.9179         -0.9179          1107            91          1107            91
Integral_LocFalse_BaseBin_AlignmentRight_Int64                       -0.9238         -0.9238          1147            87          1147            87
Integral_LocFalse_BaseBin_ZeroPadding_Int64                          -0.9170         -0.9170          1137            94          1137            94
Integral_LocFalse_BaseBin_AlignNone_Uint64                           -0.5923         -0.5923           175            71           175            71
Integral_LocFalse_BaseBin_AlignmentLeft_Uint64                       -0.9251         -0.9251          1154            86          1154            86
Integral_LocFalse_BaseBin_AlignmentCenter_Uint64                     -0.9204         -0.9204          1105            88          1105            88
Integral_LocFalse_BaseBin_AlignmentRight_Uint64                      -0.9242         -0.9242          1125            85          1125            85
Integral_LocFalse_BaseBin_ZeroPadding_Uint64                         -0.9232         -0.9232          1139            88          1139            88
Integral_LocFalse_BaseOct_AlignNone_Int64                            -0.3241         -0.3241           100            67           100            67
Integral_LocFalse_BaseOct_AlignmentLeft_Int64                        -0.9322         -0.9322          1166            79          1166            79
Integral_LocFalse_BaseOct_AlignmentCenter_Int64                      -0.9251         -0.9251          1108            83          1108            83
Integral_LocFalse_BaseOct_AlignmentRight_Int64                       -0.9303         -0.9303          1136            79          1136            79
Integral_LocFalse_BaseOct_ZeroPadding_Int64                          -0.9264         -0.9264          1156            85          1156            85
Integral_LocFalse_BaseOct_AlignNone_Uint64                           -0.3116         -0.3116            96            66            96            66
Integral_LocFalse_BaseOct_AlignmentLeft_Uint64                       -0.9310         -0.9310          1168            81          1168            81
Integral_LocFalse_BaseOct_AlignmentCenter_Uint64                     -0.9281         -0.9281          1128            81          1128            81
Integral_LocFalse_BaseOct_AlignmentRight_Uint64                      -0.9299         -0.9299          1148            80          1148            80
Integral_LocFalse_BaseOct_ZeroPadding_Uint64                         -0.9288         -0.9288          1153            82          1153            82
Integral_LocFalse_BaseDec_AlignNone_Int64                            -0.3342         -0.3342            95            63            95            63
Integral_LocFalse_BaseDec_AlignmentLeft_Int64                        -0.9360         -0.9360          1157            74          1157            74
Integral_LocFalse_BaseDec_AlignmentCenter_Int64                      -0.9303         -0.9303          1128            79          1128            79
Integral_LocFalse_BaseDec_AlignmentRight_Int64                       -0.9369         -0.9369          1164            73          1164            73
Integral_LocFalse_BaseDec_ZeroPadding_Int64                          -0.9323         -0.9323          1157            78          1157            78
Integral_LocFalse_BaseDec_AlignNone_Uint64                           -0.3198         -0.3198            93            63            93            63
Integral_LocFalse_BaseDec_AlignmentLeft_Uint64                       -0.9351         -0.9351          1158            75          1158            75
Integral_LocFalse_BaseDec_AlignmentCenter_Uint64                     -0.9298         -0.9298          1128            79          1128            79
Integral_LocFalse_BaseDec_AlignmentRight_Uint64                      -0.9361         -0.9361          1157            74          1157            74
Integral_LocFalse_BaseDec_ZeroPadding_Uint64                         -0.9333         -0.9333          1151            77          1151            77
Integral_LocFalse_BaseHex_AlignNone_Int64                            -0.3020         -0.3020            89            62            89            62
Integral_LocFalse_BaseHex_AlignmentLeft_Int64                        -0.9357         -0.9357          1174            75          1174            75
Integral_LocFalse_BaseHex_AlignmentCenter_Int64                      -0.9319         -0.9319          1129            77          1129            77
Integral_LocFalse_BaseHex_AlignmentRight_Int64                       -0.9350         -0.9350          1161            75          1161            75
Integral_LocFalse_BaseHex_ZeroPadding_Int64                          -0.9293         -0.9293          1150            81          1150            81
Integral_LocFalse_BaseHex_AlignNone_Uint64                           -0.3056         -0.3057            86            59            86            59
Integral_LocFalse_BaseHex_AlignmentLeft_Uint64                       -0.9378         -0.9378          1174            73          1174            73
Integral_LocFalse_BaseHex_AlignmentCenter_Uint64                     -0.9341         -0.9341          1129            74          1130            74
Integral_LocFalse_BaseHex_AlignmentRight_Uint64                      -0.9361         -0.9361          1157            74          1157            74
Integral_LocFalse_BaseHex_ZeroPadding_Uint64                         -0.9315         -0.9315          1147            79          1147            79
Integral_LocFalse_BaseHexUpper_AlignNone_Int64                       -0.0019         -0.0019            91            90            91            90
Integral_LocFalse_BaseHexUpper_AlignmentLeft_Int64                   -0.9099         -0.9099          1162           105          1162           105
Integral_LocFalse_BaseHexUpper_AlignmentCenter_Int64                 -0.9041         -0.9041          1121           108          1121           108
Integral_LocFalse_BaseHexUpper_AlignmentRight_Int64                  -0.9086         -0.9086          1162           106          1162           106
Integral_LocFalse_BaseHexUpper_ZeroPadding_Int64                     -0.9057         -0.9057          1164           110          1164           110
Integral_LocFalse_BaseHexUpper_AlignNone_Uint64                      +0.0110         +0.0110            86            87            86            87
Integral_LocFalse_BaseHexUpper_AlignmentLeft_Uint64                  -0.9136         -0.9136          1161           100          1161           100
Integral_LocFalse_BaseHexUpper_AlignmentCenter_Uint64                -0.9078         -0.9078          1133           104          1133           104
Integral_LocFalse_BaseHexUpper_AlignmentRight_Uint64                 -0.9132         -0.9132          1177           102          1177           102
Integral_LocFalse_BaseHexUpper_ZeroPadding_Uint64                    -0.9091         -0.9091          1160           105          1160           105
```
Other benchmarks give similar results.

Reviewed By: #libc, ldionne

Differential Revision: https://reviews.llvm.org/D129964

Added: 
    

Modified: 
    libcxx/include/__format/buffer.h
    libcxx/include/__format/formatter_floating_point.h
    libcxx/include/__format/formatter_integral.h
    libcxx/include/__format/formatter_output.h
    libcxx/test/std/utilities/format/format.formatter/format.formatter.spec/formatter.unsigned_integral.pass.cpp
    libcxx/test/std/utilities/format/format.functions/format_tests.h

Removed: 
    


################################################################################
diff  --git a/libcxx/include/__format/buffer.h b/libcxx/include/__format/buffer.h
index 1837ad06ba18e..4f972264e481f 100644

--- a/libcxx/include/__format/buffer.h
+++ b/libcxx/include/__format/buffer.h
@@ -11,8 +11,10 @@
 #define _LIBCPP___FORMAT_BUFFER_H
 
 #include <__algorithm/copy_n.h>
+#include <__algorithm/fill_n.h>
 #include <__algorithm/max.h>
 #include <__algorithm/min.h>
+#include <__algorithm/transform.h>
 #include <__algorithm/unwrap_iter.h>
 #include <__config>
 #include <__format/enable_insertable.h>
@@ -26,6 +28,7 @@
 #include <__utility/move.h>
 #include <concepts>
 #include <cstddef>
+#include <string_view>
 #include <type_traits>
 
 #if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
@@ -69,8 +72,6 @@ class _LIBCPP_TEMPLATE_VIS __output_buffer {
     return back_insert_iterator{*this};
   }
 
-  // TODO FMT It would be nice to have an overload taking a
-  // basic_string_view<_CharT> and append it directly.
   _LIBCPP_HIDE_FROM_ABI void push_back(_CharT __c) {
     __ptr_[__size_++] = __c;
 
@@ -80,6 +81,95 @@ class _LIBCPP_TEMPLATE_VIS __output_buffer {
       flush();
   }
 
+  /// Copies the input __str to the buffer.
+  ///
+  /// Since some of the input is generated by std::to_chars, there needs to be a
+  /// conversion when _CharT is wchar_t.
+  template <__formatter::__char_type _InCharT>
+  _LIBCPP_HIDE_FROM_ABI void __copy(basic_string_view<_InCharT> __str) {
+    // When the underlying iterator is a simple iterator the __capacity_ is
+    // infinite. For a string or container back_inserter it isn't. This means
+    // adding a large string the the buffer can cause some overhead. In that
+    // case a better approach could be:
+    // - flush the buffer
+    // - container.append(__str.begin(), __str.end());
+    // The same holds true for the fill.
+    // For transform it might be slightly harder, however the use case for
+    // transform is slightly less common; it converts hexadecimal values to
+    // upper case. For integral these strings are short.
+    // TODO FMT Look at the improvements above.
+    size_t __n = __str.size();
+
+    __flush_on_overflow(__n);
+    if (__n <= __capacity_) {
+      _VSTD::copy_n(__str.data(), __n, _VSTD::addressof(__ptr_[__size_]));
+      __size_ += __n;
+      return;
+    }
+
+    // The output doesn't fit in the internal buffer.
+    // Copy the data in "__capacity_" sized chunks.
+    _LIBCPP_ASSERT(__size_ == 0, "the buffer should be flushed by __flush_on_overflow");
+    const _InCharT* __first = __str.data();
+    do {
+      size_t __chunk = _VSTD::min(__n, __capacity_);
+      _VSTD::copy_n(__first, __chunk, _VSTD::addressof(__ptr_[__size_]));
+      __size_ = __chunk;
+      __first += __chunk;
+      __n -= __chunk;
+      flush();
+    } while (__n);
+  }
+
+  /// A std::transform wrapper.
+  ///
+  /// Like @ref __copy it may need to do type conversion.
+  template <__formatter::__char_type _InCharT, class _UnaryOperation>
+  _LIBCPP_HIDE_FROM_ABI void __transform(const _InCharT* __first, const _InCharT* __last, _UnaryOperation __operation) {
+    _LIBCPP_ASSERT(__first <= __last, "not a valid range");
+
+    size_t __n = static_cast<size_t>(__last - __first);
+    __flush_on_overflow(__n);
+    if (__n <= __capacity_) {
+      _VSTD::transform(__first, __last, _VSTD::addressof(__ptr_[__size_]), _VSTD::move(__operation));
+      __size_ += __n;
+      return;
+    }
+
+    // The output doesn't fit in the internal buffer.
+    // Transform the data in "__capacity_" sized chunks.
+    _LIBCPP_ASSERT(__size_ == 0, "the buffer should be flushed by __flush_on_overflow");
+    do {
+      size_t __chunk = _VSTD::min(__n, __capacity_);
+      _VSTD::transform(__first, __first + __chunk, _VSTD::addressof(__ptr_[__size_]), __operation);
+      __size_ = __chunk;
+      __first += __chunk;
+      __n -= __chunk;
+      flush();
+    } while (__n);
+  }
+
+  /// A \c fill_n wrapper.
+  _LIBCPP_HIDE_FROM_ABI void __fill(size_t __n, _CharT __value) {
+    __flush_on_overflow(__n);
+    if (__n <= __capacity_) {
+      _VSTD::fill_n(_VSTD::addressof(__ptr_[__size_]), __n, __value);
+      __size_ += __n;
+      return;
+    }
+
+    // The output doesn't fit in the internal buffer.
+    // Fill the buffer in "__capacity_" sized chunks.
+    _LIBCPP_ASSERT(__size_ == 0, "the buffer should be flushed by __flush_on_overflow");
+    do {
+      size_t __chunk = _VSTD::min(__n, __capacity_);
+      _VSTD::fill_n(_VSTD::addressof(__ptr_[__size_]), __chunk, __value);
+      __size_ = __chunk;
+      __n -= __chunk;
+      flush();
+    } while (__n);
+  }
+
   _LIBCPP_HIDE_FROM_ABI void flush() {
     __flush_(__ptr_, __size_, __obj_);
     __size_ = 0;
@@ -91,6 +181,44 @@ class _LIBCPP_TEMPLATE_VIS __output_buffer {
   size_t __size_{0};
   void (*__flush_)(_CharT*, size_t, void*);
   void* __obj_;
+
+  /// Flushes the buffer when the output operation would overflow the buffer.
+  ///
+  /// A simple approach for the overflow detection would be something along the
+  /// lines:
+  /// \code
+  /// // The internal buffer is large enough.
+  /// if (__n <= __capacity_) {
+  ///   // Flush when we really would overflow.
+  ///   if (__size_ + __n >= __capacity_)
+  ///     flush();
+  ///   ...
+  /// }
+  /// \endcode
+  ///
+  /// This approach works for all cases but one:
+  /// A __format_to_n_buffer_base where \ref __enable_direct_output is true.
+  /// In that case the \ref __capacity_ of the buffer changes during the first
+  /// \ref flush. During that operation the output buffer switches from its
+  /// __writer_ to its __storage_. The \ref __capacity_ of the former depends
+  /// on the value of n, of the latter is a fixed size. For example:
+  /// - a format_to_n call with a 10'000 char buffer,
+  /// - the buffer is filled with 9'500 chars,
+  /// - adding 1'000 elements would overflow the buffer so the buffer gets
+  ///   changed and the \ref __capacity_ decreases from 10'000 to
+  ///   __buffer_size (256 at the time of writing).
+  ///
+  /// This means that the \ref flush for this class may need to copy a part of
+  /// the internal buffer to the proper output. In this example there will be
+  /// 500 characters that need this copy operation.
+  ///
+  /// Note it would be more efficient to write 500 chars directly and then swap
+  /// the buffers. This would make the code more complex and \ref format_to_n is
+  /// not the most common use case. Therefore the optimization isn't done.
+  _LIBCPP_HIDE_FROM_ABI void __flush_on_overflow(size_t __n) {
+    if (__size_ + __n >= __capacity_)
+      flush();
+  }
 };
 
 /// A storage using an internal buffer.
@@ -280,12 +408,12 @@ struct _LIBCPP_TEMPLATE_VIS __format_to_n_buffer_base {
   using _Size = iter_
diff erence_t<_OutIt>;
 
 public:
-  _LIBCPP_HIDE_FROM_ABI explicit __format_to_n_buffer_base(_OutIt __out_it, _Size __n)
-      : __writer_(_VSTD::move(__out_it)), __n_(_VSTD::max(_Size(0), __n)) {}
+  _LIBCPP_HIDE_FROM_ABI explicit __format_to_n_buffer_base(_OutIt __out_it, _Size __max_size)
+      : __writer_(_VSTD::move(__out_it)), __max_size_(_VSTD::max(_Size(0), __max_size)) {}
 
   _LIBCPP_HIDE_FROM_ABI void flush(_CharT* __ptr, size_t __size) {
-    if (_Size(__size_) <= __n_)
-      __writer_.flush(__ptr, _VSTD::min(_Size(__size), __n_ - __size_));
+    if (_Size(__size_) <= __max_size_)
+      __writer_.flush(__ptr, _VSTD::min(_Size(__size), __max_size_ - __size_));
     __size_ += __size;
   }
 
@@ -294,7 +422,7 @@ struct _LIBCPP_TEMPLATE_VIS __format_to_n_buffer_base {
   __output_buffer<_CharT> __output_{__storage_.begin(), __storage_.__buffer_size, this};
   typename __writer_selector<_OutIt, _CharT>::type __writer_;
 
-  _Size __n_;
+  _Size __max_size_;
   _Size __size_{0};
 };
 
@@ -310,24 +438,35 @@ class _LIBCPP_TEMPLATE_VIS __format_to_n_buffer_base<_OutIt, _CharT, true> {
   using _Size = iter_
diff erence_t<_OutIt>;
 
 public:
-  _LIBCPP_HIDE_FROM_ABI explicit __format_to_n_buffer_base(_OutIt __out_it, _Size __n)
-      : __output_(_VSTD::__unwrap_iter(__out_it), __n, this), __writer_(_VSTD::move(__out_it)) {
-    if (__n <= 0) [[unlikely]]
+  _LIBCPP_HIDE_FROM_ABI explicit __format_to_n_buffer_base(_OutIt __out_it, _Size __max_size)
+      : __output_(_VSTD::__unwrap_iter(__out_it), __max_size, this),
+        __writer_(_VSTD::move(__out_it)),
+        __max_size_(__max_size) {
+    if (__max_size <= 0) [[unlikely]]
       __output_.reset(__storage_.begin(), __storage_.__buffer_size);
   }
 
   _LIBCPP_HIDE_FROM_ABI void flush(_CharT* __ptr, size_t __size) {
-    // A flush to the direct writer happens in two occasions:
+    // A flush to the direct writer happens in the following occasions:
     // - The format function has written the maximum number of allowed code
     //   units. At this point it's no longer valid to write to this writer. So
     //   switch to the internal storage. This internal storage doesn't need to
     //   be written anywhere so the flush for that storage writes no output.
+    // - Like above, but the next "mass write" operation would overflow the
+    //   buffer. In that case the buffer is pre-emptively switched. The still
+    //   valid code units will be written separately.
     // - The format_to_n function is finished. In this case there's no need to
     //   switch the buffer, but for simplicity the buffers are still switched.
-    // When the __n <= 0 the constructor already switched the buffers.
+    // When the __max_size <= 0 the constructor already switched the buffers.
     if (__size_ == 0 && __ptr != __storage_.begin()) {
       __writer_.flush(__ptr, __size);
       __output_.reset(__storage_.begin(), __storage_.__buffer_size);
+    } else if (__size_ < __max_size_) {
+      // Copies a part of the internal buffer to the output up to n characters.
+      // See __output_buffer<_CharT>::__flush_on_overflow for more information.
+      _Size __s = _VSTD::min(_Size(__size), __max_size_ - __size_);
+      std::copy_n(__ptr, __s, __writer_.out());
+      __writer_.flush(__ptr, __s);
     }
 
     __size_ += __size;
@@ -338,6 +477,7 @@ class _LIBCPP_TEMPLATE_VIS __format_to_n_buffer_base<_OutIt, _CharT, true> {
   __output_buffer<_CharT> __output_;
   __writer_direct<_OutIt, _CharT> __writer_;
 
+  _Size __max_size_;
   _Size __size_{0};
 };
 
@@ -350,7 +490,8 @@ struct _LIBCPP_TEMPLATE_VIS __format_to_n_buffer final
   using _Size = iter_
diff erence_t<_OutIt>;
 
 public:
-  _LIBCPP_HIDE_FROM_ABI explicit __format_to_n_buffer(_OutIt __out_it, _Size __n) : _Base(_VSTD::move(__out_it), __n) {}
+  _LIBCPP_HIDE_FROM_ABI explicit __format_to_n_buffer(_OutIt __out_it, _Size __max_size)
+      : _Base(_VSTD::move(__out_it), __max_size) {}
   _LIBCPP_HIDE_FROM_ABI auto make_output_iterator() { return this->__output_.make_output_iterator(); }
 
   _LIBCPP_HIDE_FROM_ABI format_to_n_result<_OutIt> result() && {

diff  --git a/libcxx/include/__format/formatter_floating_point.h b/libcxx/include/__format/formatter_floating_point.h
index 16be89347b8ac..3a65ed436defc 100644
--- a/libcxx/include/__format/formatter_floating_point.h
+++ b/libcxx/include/__format/formatter_floating_point.h
@@ -10,9 +10,7 @@
 #ifndef _LIBCPP___FORMAT_FORMATTER_FLOATING_POINT_H
 #define _LIBCPP___FORMAT_FORMATTER_FLOATING_POINT_H
 
-#include <__algorithm/copy.h>
 #include <__algorithm/copy_n.h>
-#include <__algorithm/fill_n.h>
 #include <__algorithm/find.h>
 #include <__algorithm/min.h>
 #include <__algorithm/rotate.h>
@@ -528,13 +526,13 @@ _LIBCPP_HIDE_FROM_ABI _OutIt __format_locale_specific_form(
   // sign and (zero padding or alignment)
   if (__zero_padding && __first != __buffer.begin())
     *__out_it++ = *__buffer.begin();
-  __out_it = _VSTD::fill_n(_VSTD::move(__out_it), __padding.__before_, __specs.__fill_);
+  __out_it = __formatter::__fill(_VSTD::move(__out_it), __padding.__before_, __specs.__fill_);
   if (!__zero_padding && __first != __buffer.begin())
     *__out_it++ = *__buffer.begin();
 
   // integral part
   if (__grouping.empty()) {
-    __out_it = _VSTD::copy_n(__first, __digits, _VSTD::move(__out_it));
+    __out_it = __formatter::__copy(__first, __digits, _VSTD::move(__out_it));
   } else {
     auto __r = __grouping.rbegin();
     auto __e = __grouping.rend() - 1;
@@ -546,7 +544,7 @@ _LIBCPP_HIDE_FROM_ABI _OutIt __format_locale_specific_form(
     // This loop achieves that process by testing the termination condition
     // midway in the loop.
     while (true) {
-      __out_it = _VSTD::copy_n(__first, *__r, _VSTD::move(__out_it));
+      __out_it = __formatter::__copy(__first, *__r, _VSTD::move(__out_it));
       __first += *__r;
 
       if (__r == __e)
@@ -560,16 +558,16 @@ _LIBCPP_HIDE_FROM_ABI _OutIt __format_locale_specific_form(
   // fractional part
   if (__result.__radix_point != __result.__last) {
     *__out_it++ = __np.decimal_point();
-    __out_it = _VSTD::copy(__result.__radix_point + 1, __result.__exponent, _VSTD::move(__out_it));
-    __out_it = _VSTD::fill_n(_VSTD::move(__out_it), __buffer.__num_trailing_zeros(), _CharT('0'));
+    __out_it    = __formatter::__copy(__result.__radix_point + 1, __result.__exponent, _VSTD::move(__out_it));
+    __out_it    = __formatter::__fill(_VSTD::move(__out_it), __buffer.__num_trailing_zeros(), _CharT('0'));
   }
 
   // exponent
   if (__result.__exponent != __result.__last)
-    __out_it = _VSTD::copy(__result.__exponent, __result.__last, _VSTD::move(__out_it));
+    __out_it = __formatter::__copy(__result.__exponent, __result.__last, _VSTD::move(__out_it));
 
   // alignment
-  return _VSTD::fill_n(_VSTD::move(__out_it), __padding.__after_, __specs.__fill_);
+  return __formatter::__fill(_VSTD::move(__out_it), __padding.__after_, __specs.__fill_);
 }
 #  endif // _LIBCPP_HAS_NO_LOCALIZATION
 
@@ -651,14 +649,15 @@ __format_floating_point(_Tp __value, auto& __ctx, __format_spec::__parsed_specif
   if (__size + __num_trailing_zeros >= __specs.__width_) {
     if (__num_trailing_zeros && __result.__exponent != __result.__last)
       // Insert trailing zeros before exponent character.
-      return _VSTD::copy(
+      return __formatter::__copy(
           __result.__exponent,
           __result.__last,
-          _VSTD::fill_n(
-              _VSTD::copy(__buffer.begin(), __result.__exponent, __ctx.out()), __num_trailing_zeros, _CharT('0')));
+          __formatter::__fill(__formatter::__copy(__buffer.begin(), __result.__exponent, __ctx.out()),
+                              __num_trailing_zeros,
+                              _CharT('0')));
 
-    return _VSTD::fill_n(
-        _VSTD::copy(__buffer.begin(), __result.__last, __ctx.out()), __num_trailing_zeros, _CharT('0'));
+    return __formatter::__fill(
+        __formatter::__copy(__buffer.begin(), __result.__last, __ctx.out()), __num_trailing_zeros, _CharT('0'));
   }
 
   auto __out_it = __ctx.out();

diff  --git a/libcxx/include/__format/formatter_integral.h b/libcxx/include/__format/formatter_integral.h
index b9ed5fe80f7f3..834a402081aa6 100644
--- a/libcxx/include/__format/formatter_integral.h
+++ b/libcxx/include/__format/formatter_integral.h
@@ -243,7 +243,7 @@ _LIBCPP_HIDE_FROM_ABI auto __format_integer(
     // The zero padding is done like:
     // - Write [sign][prefix]
     // - Write data right aligned with '0' as fill character.
-    __out_it             = _VSTD::copy(__begin, __first, _VSTD::move(__out_it));
+    __out_it             = __formatter::__copy(__begin, __first, _VSTD::move(__out_it));
     __specs.__alignment_ = __format_spec::__alignment::__right;
     __specs.__fill_      = _CharT('0');
     int32_t __size       = __first - __begin;

diff  --git a/libcxx/include/__format/formatter_output.h b/libcxx/include/__format/formatter_output.h
index e09534c41dff0..1852c88ea4fb2 100644
--- a/libcxx/include/__format/formatter_output.h
+++ b/libcxx/include/__format/formatter_output.h
@@ -14,10 +14,13 @@
 #include <__algorithm/copy_n.h>
 #include <__algorithm/fill_n.h>
 #include <__algorithm/transform.h>
+#include <__concepts/same_as.h>
 #include <__config>
+#include <__format/buffer.h>
 #include <__format/formatter.h>
 #include <__format/parser_std_format_spec.h>
 #include <__format/unicode.h>
+#include <__iterator/back_insert_iterator.h>
 #include <__utility/move.h>
 #include <__utility/unreachable.h>
 #include <cstddef>
@@ -86,6 +89,63 @@ __padding_size(size_t __size, size_t __width, __format_spec::__alignment __align
   __libcpp_unreachable();
 }
 
+/// Copy wrapper.
+///
+/// This uses a "mass output function" of __format::__output_buffer when possible.
+template <__formatter::__char_type _CharT, __formatter::__char_type _OutCharT = _CharT>
+_LIBCPP_HIDE_FROM_ABI auto __copy(basic_string_view<_CharT> __str, output_iterator<const _OutCharT&> auto __out_it)
+    -> decltype(__out_it) {
+  if constexpr (_VSTD::same_as<decltype(__out_it), _VSTD::back_insert_iterator<__format::__output_buffer<_OutCharT>>>) {
+    __out_it.__get_container()->__copy(__str);
+    return __out_it;
+  } else {
+    return std::copy_n(__str.data(), __str.size(), _VSTD::move(__out_it));
+  }
+}
+
+template <__formatter::__char_type _CharT, __formatter::__char_type _OutCharT = _CharT>
+_LIBCPP_HIDE_FROM_ABI auto
+__copy(const _CharT* __first, const _CharT* __last, output_iterator<const _OutCharT&> auto __out_it)
+    -> decltype(__out_it) {
+  return __formatter::__copy(basic_string_view{__first, __last}, _VSTD::move(__out_it));
+}
+
+template <__formatter::__char_type _CharT, __formatter::__char_type _OutCharT = _CharT>
+_LIBCPP_HIDE_FROM_ABI auto __copy(const _CharT* __first, size_t __n, output_iterator<const _OutCharT&> auto __out_it)
+    -> decltype(__out_it) {
+  return __formatter::__copy(basic_string_view{__first, __n}, _VSTD::move(__out_it));
+}
+
+/// Transform wrapper.
+///
+/// This uses a "mass output function" of __format::__output_buffer when possible.
+template <__formatter::__char_type _CharT, __formatter::__char_type _OutCharT = _CharT, class _UnaryOperation>
+_LIBCPP_HIDE_FROM_ABI auto
+__transform(const _CharT* __first,
+            const _CharT* __last,
+            output_iterator<const _OutCharT&> auto __out_it,
+            _UnaryOperation __operation) -> decltype(__out_it) {
+  if constexpr (_VSTD::same_as<decltype(__out_it), _VSTD::back_insert_iterator<__format::__output_buffer<_OutCharT>>>) {
+    __out_it.__get_container()->__transform(__first, __last, _VSTD::move(__operation));
+    return __out_it;
+  } else {
+    return std::transform(__first, __last, _VSTD::move(__out_it), __operation);
+  }
+}
+
+/// Fill wrapper.
+///
+/// This uses a "mass output function" of __format::__output_buffer when possible.
+template <__formatter::__char_type _CharT, output_iterator<const _CharT&> _OutIt>
+_LIBCPP_HIDE_FROM_ABI _OutIt __fill(_OutIt __out_it, size_t __n, _CharT __value) {
+  if constexpr (_VSTD::same_as<decltype(__out_it), _VSTD::back_insert_iterator<__format::__output_buffer<_CharT>>>) {
+    __out_it.__get_container()->__fill(__n, __value);
+    return __out_it;
+  } else {
+    return std::fill_n(_VSTD::move(__out_it), __n, __value);
+  }
+}
+
 template <class _OutIt, class _CharT>
 _LIBCPP_HIDE_FROM_ABI _OutIt __write_using_decimal_separators(_OutIt __out_it, const char* __begin, const char* __first,
                                                               const char* __last, string&& __grouping, _CharT __sep,
@@ -97,22 +157,22 @@ _LIBCPP_HIDE_FROM_ABI _OutIt __write_using_decimal_separators(_OutIt __out_it, c
   __padding_size_result __padding = {0, 0};
   if (__specs.__alignment_ == __format_spec::__alignment::__zero_padding) {
     // Write [sign][prefix].
-    __out_it = _VSTD::copy(__begin, __first, _VSTD::move(__out_it));
+    __out_it = __formatter::__copy(__begin, __first, _VSTD::move(__out_it));
 
     if (__specs.__width_ > __size) {
       // Write zero padding.
       __padding.__before_ = __specs.__width_ - __size;
-      __out_it = _VSTD::fill_n(_VSTD::move(__out_it), __specs.__width_ - __size, _CharT('0'));
+      __out_it            = __formatter::__fill(_VSTD::move(__out_it), __specs.__width_ - __size, _CharT('0'));
     }
   } else {
     if (__specs.__width_ > __size) {
       // Determine padding and write padding.
       __padding = __padding_size(__size, __specs.__width_, __specs.__alignment_);
 
-      __out_it = _VSTD::fill_n(_VSTD::move(__out_it), __padding.__before_, __specs.__fill_);
+      __out_it = __formatter::__fill(_VSTD::move(__out_it), __padding.__before_, __specs.__fill_);
     }
     // Write [sign][prefix].
-    __out_it = _VSTD::copy(__begin, __first, _VSTD::move(__out_it));
+    __out_it = __formatter::__copy(__begin, __first, _VSTD::move(__out_it));
   }
 
   auto __r = __grouping.rbegin();
@@ -133,10 +193,10 @@ _LIBCPP_HIDE_FROM_ABI _OutIt __write_using_decimal_separators(_OutIt __out_it, c
   while (true) {
     if (__specs.__std_.__type_ == __format_spec::__type::__hexadecimal_upper_case) {
       __last = __first + *__r;
-      __out_it = _VSTD::transform(__first, __last, _VSTD::move(__out_it), __hex_to_upper);
+      __out_it = __formatter::__transform(__first, __last, _VSTD::move(__out_it), __hex_to_upper);
       __first = __last;
     } else {
-      __out_it = _VSTD::copy_n(__first, *__r, _VSTD::move(__out_it));
+      __out_it = __formatter::__copy(__first, *__r, _VSTD::move(__out_it));
       __first += *__r;
     }
 
@@ -147,7 +207,7 @@ _LIBCPP_HIDE_FROM_ABI _OutIt __write_using_decimal_separators(_OutIt __out_it, c
     *__out_it++ = __sep;
   }
 
-  return _VSTD::fill_n(_VSTD::move(__out_it), __padding.__after_, __specs.__fill_);
+  return __formatter::__fill(_VSTD::move(__out_it), __padding.__after_, __specs.__fill_);
 }
 
 /// Writes the input to the output with the required padding.
@@ -155,12 +215,10 @@ _LIBCPP_HIDE_FROM_ABI _OutIt __write_using_decimal_separators(_OutIt __out_it, c
 /// Since the output column width is specified the function can be used for
 /// ASCII and Unicode output.
 ///
-/// \pre [\a __first, \a __last) is a valid range.
 /// \pre \a __size <= \a __width. Using this function when this pre-condition
 ///      doesn't hold incurs an unwanted overhead.
 ///
-/// \param __first     Pointer to the first element to write.
-/// \param __last      Pointer beyond the last element to write.
+/// \param __str       The string to write.
 /// \param __out_it    The output iterator to write to.
 /// \param __specs     The parsed formatting specifications.
 /// \param __size      The (estimated) output column width. When the elements
@@ -174,31 +232,42 @@ _LIBCPP_HIDE_FROM_ABI _OutIt __write_using_decimal_separators(_OutIt __out_it, c
 /// conversion, which means the [\a __first, \a __last) always contains elements
 /// of the type \c char.
 template <class _CharT, class _ParserCharT>
-_LIBCPP_HIDE_FROM_ABI auto __write(
-    const _CharT* __first,
-    const _CharT* __last,
-    output_iterator<const _CharT&> auto __out_it,
-    __format_spec::__parsed_specifications<_ParserCharT> __specs,
-    ptr
diff _t __size) -> decltype(__out_it) {
-  _LIBCPP_ASSERT(__first <= __last, "Not a valid range");
-
+_LIBCPP_HIDE_FROM_ABI auto
+__write(basic_string_view<_CharT> __str,
+        output_iterator<const _CharT&> auto __out_it,
+        __format_spec::__parsed_specifications<_ParserCharT> __specs,
+        ptr
diff _t __size) -> decltype(__out_it) {
   if (__size >= __specs.__width_)
-    return _VSTD::copy(__first, __last, _VSTD::move(__out_it));
+    return __formatter::__copy(__str, _VSTD::move(__out_it));
 
   __padding_size_result __padding = __formatter::__padding_size(__size, __specs.__width_, __specs.__std_.__alignment_);
-  __out_it = _VSTD::fill_n(_VSTD::move(__out_it), __padding.__before_, __specs.__fill_);
-  __out_it = _VSTD::copy(__first, __last, _VSTD::move(__out_it));
-  return _VSTD::fill_n(_VSTD::move(__out_it), __padding.__after_, __specs.__fill_);
+  __out_it                        = __formatter::__fill(_VSTD::move(__out_it), __padding.__before_, __specs.__fill_);
+  __out_it                        = __formatter::__copy(__str, _VSTD::move(__out_it));
+  return __formatter::__fill(_VSTD::move(__out_it), __padding.__after_, __specs.__fill_);
+}
+
+template <class _CharT, class _ParserCharT>
+_LIBCPP_HIDE_FROM_ABI auto
+__write(const _CharT* __first,
+        const _CharT* __last,
+        output_iterator<const _CharT&> auto __out_it,
+        __format_spec::__parsed_specifications<_ParserCharT> __specs,
+        ptr
diff _t __size) -> decltype(__out_it) {
+  _LIBCPP_ASSERT(__first <= __last, "Not a valid range");
+  return __formatter::__write(basic_string_view{__first, __last}, _VSTD::move(__out_it), __specs, __size);
 }
 
 /// \overload
 ///
 /// Calls the function above where \a __size = \a __last - \a __first.
 template <class _CharT, class _ParserCharT>
-_LIBCPP_HIDE_FROM_ABI auto __write(const _CharT* __first, const _CharT* __last,
-                                   output_iterator<const _CharT&> auto __out_it,
-                                   __format_spec::__parsed_specifications<_ParserCharT> __specs) -> decltype(__out_it) {
-  return __write(__first, __last, _VSTD::move(__out_it), __specs, __last - __first);
+_LIBCPP_HIDE_FROM_ABI auto
+__write(const _CharT* __first,
+        const _CharT* __last,
+        output_iterator<const _CharT&> auto __out_it,
+        __format_spec::__parsed_specifications<_ParserCharT> __specs) -> decltype(__out_it) {
+  _LIBCPP_ASSERT(__first <= __last, "Not a valid range");
+  return __formatter::__write(__first, __last, _VSTD::move(__out_it), __specs, __last - __first);
 }
 
 template <class _CharT, class _ParserCharT, class _UnaryOperation>
@@ -210,12 +279,12 @@ _LIBCPP_HIDE_FROM_ABI auto __write_transformed(const _CharT* __first, const _Cha
 
   ptr
diff _t __size = __last - __first;
   if (__size >= __specs.__width_)
-    return _VSTD::transform(__first, __last, _VSTD::move(__out_it), __op);
+    return __formatter::__transform(__first, __last, _VSTD::move(__out_it), __op);
 
   __padding_size_result __padding = __padding_size(__size, __specs.__width_, __specs.__alignment_);
-  __out_it = _VSTD::fill_n(_VSTD::move(__out_it), __padding.__before_, __specs.__fill_);
-  __out_it = _VSTD::transform(__first, __last, _VSTD::move(__out_it), __op);
-  return _VSTD::fill_n(_VSTD::move(__out_it), __padding.__after_, __specs.__fill_);
+  __out_it                        = __formatter::__fill(_VSTD::move(__out_it), __padding.__before_, __specs.__fill_);
+  __out_it                        = __formatter::__transform(__first, __last, _VSTD::move(__out_it), __op);
+  return __formatter::__fill(_VSTD::move(__out_it), __padding.__after_, __specs.__fill_);
 }
 
 /// Writes additional zero's for the precision before the exponent.
@@ -240,11 +309,11 @@ _LIBCPP_HIDE_FROM_ABI auto __write_using_trailing_zeros(
 
   __padding_size_result __padding =
       __padding_size(__size + __num_trailing_zeros, __specs.__width_, __specs.__alignment_);
-  __out_it = _VSTD::fill_n(_VSTD::move(__out_it), __padding.__before_, __specs.__fill_);
-  __out_it = _VSTD::copy(__first, __exponent, _VSTD::move(__out_it));
-  __out_it = _VSTD::fill_n(_VSTD::move(__out_it), __num_trailing_zeros, _CharT('0'));
-  __out_it = _VSTD::copy(__exponent, __last, _VSTD::move(__out_it));
-  return _VSTD::fill_n(_VSTD::move(__out_it), __padding.__after_, __specs.__fill_);
+  __out_it = __formatter::__fill(_VSTD::move(__out_it), __padding.__before_, __specs.__fill_);
+  __out_it = __formatter::__copy(__first, __exponent, _VSTD::move(__out_it));
+  __out_it = __formatter::__fill(_VSTD::move(__out_it), __num_trailing_zeros, _CharT('0'));
+  __out_it = __formatter::__copy(__exponent, __last, _VSTD::move(__out_it));
+  return __formatter::__fill(_VSTD::move(__out_it), __padding.__after_, __specs.__fill_);
 }
 
 /// Writes a string using format's width estimation algorithm.
@@ -262,7 +331,7 @@ _LIBCPP_HIDE_FROM_ABI auto __write_string_no_precision(
 
   // No padding -> copy the string
   if (!__specs.__has_width())
-    return _VSTD::copy(__str.begin(), __str.end(), _VSTD::move(__out_it));
+    return __formatter::__copy(__str, _VSTD::move(__out_it));
 
   // Note when the estimated width is larger than size there's no padding. So
   // there's no reason to get the real size when the estimate is larger than or
@@ -270,8 +339,7 @@ _LIBCPP_HIDE_FROM_ABI auto __write_string_no_precision(
   size_t __size =
       __format_spec::__estimate_column_width(__str, __specs.__width_, __format_spec::__column_width_rounding::__up)
           .__width_;
-
-  return __formatter::__write(__str.begin(), __str.end(), _VSTD::move(__out_it), __specs, __size);
+  return __formatter::__write(__str, _VSTD::move(__out_it), __specs, __size);
 }
 
 template <class _CharT>

diff  --git a/libcxx/test/std/utilities/format/format.formatter/format.formatter.spec/formatter.unsigned_integral.pass.cpp b/libcxx/test/std/utilities/format/format.formatter/format.formatter.spec/formatter.unsigned_integral.pass.cpp
index c3e426fcba1cc..36f2dbd4b8b48 100644
--- a/libcxx/test/std/utilities/format/format.formatter/format.formatter.spec/formatter.unsigned_integral.pass.cpp
+++ b/libcxx/test/std/utilities/format/format.formatter/format.formatter.spec/formatter.unsigned_integral.pass.cpp
@@ -88,6 +88,8 @@ void test_unsigned_integral_type() {
     test_termination_condition(
         STR("340282366920938463463374607431768211455"), STR("}"), A(std::numeric_limits<__uint128_t>::max()));
 #endif
+  // Test __formatter::__transform (libc++ specific).
+  test_termination_condition(STR("FF"), STR("X}"), A(255));
 }
 
 template <class CharT>

diff  --git a/libcxx/test/std/utilities/format/format.functions/format_tests.h b/libcxx/test/std/utilities/format/format.functions/format_tests.h
index ba6f67b0e24fd..551e1dd066a99 100644
--- a/libcxx/test/std/utilities/format/format.functions/format_tests.h
+++ b/libcxx/test/std/utilities/format/format.functions/format_tests.h
@@ -2557,6 +2557,68 @@ void format_test_pointer(TestFunction check, ExceptionTest check_exception) {
   format_test_pointer<const void*, CharT>(check, check_exception);
 }
 
+/// Tests special buffer functions with a "large" input.
+///
+/// This is a test specific for libc++, however the code should behave the same
+/// on all implementations.
+/// In \c __format::__output_buffer there are some special functions to optimize
+/// outputting multiple characters, \c __copy, \c __transform, \c __fill. This
+/// test validates whether the functions behave properly when the output size
+/// doesn't fit in its internal buffer.
+template <class CharT, class TestFunction>
+void format_test_buffer_optimizations(TestFunction check) {
+#ifdef _LIBCPP_VERSION
+  // Used to validate our test sets are the proper size.
+  // To test the chunked operations it needs to be larger than the internal
+  // buffer. Picked a nice looking number.
+  constexpr int minimum = 3 * std::__format::__internal_storage<CharT>::__buffer_size;
+#else
+  constexpr int minimum = 1;
+#endif
+
+  // Copy
+  std::basic_string<CharT> str = STR(
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog."
+      "The quick brown fox jumps over the lazy dog.");
+  assert(str.size() > minimum);
+  check.template operator()<"{}">(std::basic_string_view<CharT>{str}, str);
+
+  // Fill
+  std::basic_string<CharT> fill(minimum, CharT('*'));
+  check.template operator()<"{:*<{}}">(std::basic_string_view<CharT>{str + fill}, str, str.size() + minimum);
+  check.template operator()<"{:*^{}}">(
+      std::basic_string_view<CharT>{fill + str + fill}, str, minimum + str.size() + minimum);
+  check.template operator()<"{:*>{}}">(std::basic_string_view<CharT>{fill + str}, str, minimum + str.size());
+}
+
 template <class CharT, class TestFunction, class ExceptionTest>
 void format_tests(TestFunction check, ExceptionTest check_exception) {
   // *** Test escaping  ***
@@ -2671,6 +2733,9 @@ void format_tests(TestFunction check, ExceptionTest check_exception) {
 
   // *** Test handle formatter argument ***
   format_test_handle<CharT>(check, check_exception);
+
+  // *** Test the interal buffer optimizations ***
+  format_test_buffer_optimizations<CharT>(check);
 }
 
 #ifndef TEST_HAS_NO_WIDE_CHARACTERS