[PATCH] D112175: [NFC] Add llvm::StaticVector ADT

Fri Apr 29 01:30:53 PDT 2022

dexonsmith added a comment.

In D112175#3481680 <https://reviews.llvm.org/D112175#3481680>, @jplayer-nv wrote:

>>   [ RUN      ] StaticVsSmall/0.TimedStaticVector
>>   [       OK ] StaticVsSmall/0.TimedStaticVector (371 ms)
>>   [ RUN      ] StaticVsSmall/0.TimedSmallVector
>>   [       OK ] StaticVsSmall/0.TimedSmallVector (596 ms)
>>
>> Which demonstrates that simple insertions are significantly faster (almost 38%) with the StaticVector compared to a SmallVector which stays within its small buffer limit.
>
> Let me amend this a bit... it looks like the reason `SmallVector::push_back` is slower in this case is because of insufficient specialization.  Here's the trivially copyable version:
>
>   void push_back(ValueParamT Elt) {
>     const T *EltPtr = reserveForParamAndGetAddress(Elt);
>     memcpy(reinterpret_cast<void *>(this->end()), EltPtr, sizeof(T));
>     this->set_size(this->size() + 1);
>   }
>
> When `Elt` is passed by value, `reserveForParamAndGetAddress()` is unnecessary since the parameter will not alias with the `SmallVector`'s buffer.  This inhibits host compiler optimization.  If we were to SFINAE split this function, we can match my `StaticVector::push_back` perf.  The following `SmallVector` change works for me locally:
>
>     template <typename RetT = std::enable_if_t<TakesParamByValue>>
>     RetT push_back_impl(value_type Elt) {
>       if (LLVM_UNLIKELY(this->size() >= this->capacity()))
>         this->grow();
>       memcpy(reinterpret_cast<void *>(this->end()), std::addressof(Elt),
>              sizeof(T));
>       this->set_size(this->size() + 1);
>     }
>   
>     template <typename RetT = std::enable_if_t<!TakesParamByValue>>
>     RetT push_back_impl(const_reference Elt) {
>       const T *EltPtr = reserveForParamAndGetAddress(Elt);
>       memcpy(reinterpret_cast<void *>(this->end()), EltPtr, sizeof(T));
>       this->set_size(this->size() + 1);
>     }
>   
>   public:
>     void push_back(ValueParamT Elt) { push_back_impl(Elt); }

Nice find.

When we added this assertion, we checked the compile time suite and didn't get regressions. The theory being that the optimizer *should* have enough info to get to the same place on its own since it has this code `reserveForParamAndGetAddress`:

  size_t NewSize = This->size() + N;
  if (LLVM_LIKELY(NewSize <= This->capacity()))
    return &Elt;

  bool ReferencesStorage = false;
  int64_t Index = -1;
  if (!U::TakesParamByValue) {
    if (LLVM_UNLIKELY(This->isReferenceToStorage(&Elt))) {
      ReferencesStorage = true;
      Index = &Elt - This->begin();
    }
  }
  This->grow(NewSize);
  return ReferencesStorage ? This->begin() + Index : &Elt;

Ideally the outer `if` would be `if constexpr(!U::TakesParamByValue)` but that needs to wait for C++17.

The current implementation is easy to verify since the check is just in one place. Is it just a synthetic benchmark where it's causing a slowdown? Or, can we help the optimizer along somehow without changing the layering? (I don't feel strongly attached to exact design of the current thing, but IIRC it took some compromise at the time to make everyone happy. But I could be remembering the patch series incorrectly.)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112175/new/

https://reviews.llvm.org/D112175