[libcxx-commits] [PATCH] D152208: [libc++] Introduce __make_uninitialized_buffer and use it instead of get_temporary_buffer

Mon Jun 26 11:50:29 PDT 2023

mingmingl added inline comments.

================
Comment at: libcxx/include/__memory/uninitialized_buffer.h:71-73
+#else
+  _Tp* __ptr = static_cast<_Tp*>(::operator new(sizeof(_Tp) * __count, align_val_t(_LIBCPP_ALIGNOF(_Tp)), nothrow));
+#endif
----------------
philnik wrote:
> mingmingl wrote:
> > philnik wrote:
> > > mingmingl wrote:
> > > > Hi, a regression is observed in a benchmark that uses `std::stable_sort` intensively. 
> > > > 
> > > > Comparing baseline and the regressed case, the baseline (using `get_temporary_buffer`) uses unaligned alloc ( `void* operator new[](size_t size) `) when objects are not over aligned, and the regressed uses aligned alloc for all objects. This may increase the memory usage as well.
> > > > 
> > > > Given `std::stable_sort` is pretty important, I wonder if this regression warrants a fix to have consistent usage of (aligned or unaligned) new operator?
> > > I don't understand how using the aligned alloc would increase memory usage. `void* operator new[](size_t size)` is basically `void* operator new[](size_t size, align_val_t{__STDCPP_DEFAULT_NEW_ALIGNMENT__})`. So unless you have a bad implementation it shouldn't make any difference.
> > thanks for the quick response! 
> > 
> > To share more context, the malloc implementation is tcmalloc (https://github.com/google/tcmalloc); and aligned alloc exercises more instructions than unaligned one. Considering the affected library is `stable_sort`, the benefit of keeping unaligned alloc is not small.
> > 
> > - In the baseline, unaligned alloc uses `void* operator new[](size_t size) noexcept(false)` corresponds to `TCMallocInternalNewArray` (an alias of `TCMallocInternalNew`), the implementation is https://github.com/google/tcmalloc/blob/139e709eefd5bc6a301868f752cb2d8931d47d45/tcmalloc/tcmalloc.cc#L1134
> > 
> > - In the regressed case, `operator new[](unsigned long, std::align_val_t, std::nothrow_t const&)` is used -> `TCMallocInternalNewArrayAlignedNothrow` is used. The implementation is https://github.com/google/tcmalloc/blob/139e709eefd5bc6a301868f752cb2d8931d47d45/tcmalloc/tcmalloc.cc#L1195
> > 
> > Comparing the `llvm-objdump` output below, aligned operator new has a longer instruction sequence (0x154 for regressed case, 0xec for baseline) -> this increases the number of instructions executed on a hot path (`std::stable_sort`).
> > 
> > 
> > 
> > ```
> > 000000000b4702c0 g     F section-name	00000000000000ec tcmalloc_size_returning_operator_new   # new operator used in BASELINE
> > 000000000b4703c0 g     F section-name	0000000000000154 tcmalloc_size_returning_operator_new_aligned
> > 000000000b470540 g     F section-name	00000000000001ec tcmalloc_size_returning_operator_new_hot_cold
> > 000000000b470740 g     F section-name	00000000000002ac tcmalloc_size_returning_operator_new_aligned_hot_cold
> > 000000000b470a00 g     F section-name 00000000000000ec tcmalloc_size_returning_operator_new_nothrow
> > 000000000b470b00 g     F section-name	0000000000000154 tcmalloc_size_returning_operator_new_aligned_nothrow # new operator in EXPERIMENT
> > 000000000b470c80 g     F section-name	00000000000001ec tcmalloc_size_returning_operator_new_hot_cold_nothrow
> > 000000000b470e80 g     F section-name	00000000000002ac tcmalloc_size_returning_operator_new_aligned_hot_cold_nothrow
> > ```
> How exactly does that correlate to memory use? I have no idea how TCMalloc is implemented, so your links don't really help me. It could just be the case that the aligned path isn't as well optimized as the non-aligned one, resulting in slightly worse performance. I don't see any theoretical reason the aligned path should be any slower than the non-aligned one. Worst-case should be a single condition to see whether the non-aligned path should be taken if it's actually faster.
> How exactly does that correlate to memory use?
I consulted tcmalloc experts and got the confirmation that this won't increase memory usage; however, using aligned alloc where it's not necessary is going to slow down the execution. I'm not very familiar with the implementation either and misread how alignment affects the size class calculation; sorry about the wrong statement on memory increase.

In the regressed case, aligned operator new does extra work to achieve the same results. To put it the other way, the unaligned operator new can use a compile time constant in its calculations, and saves a runtime cost to choose a size class. 
- [This 10-line code snippet](https://github.com/google/tcmalloc/blob/139e709eefd5bc6a301868f752cb2d8931d47d45/tcmalloc/sizemap.h#L223-L232) and the function should hopefully roughly explain where the additional runtime cost comes from when aligned alloc is used (without a full implementation detail)

> Worst-case should be a single condition to see whether the non-aligned path should be taken if it's actually faster.

At a large scale, a single extra if statement here adds up across multiple binaries since `stable_sort` is used widely. 

Could the original unaligned new operator be added for parity?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152208/new/

https://reviews.llvm.org/D152208