[libcxx-commits] [PATCH] D80588: Optimize vector push_back to avoid continuous load and store of end pointer.

Wed May 27 10:17:18 PDT 2020

mvels added a comment.

> 1. Pass `this->__end_` to the helper functions
> 2. Return the new end pointer from those helpers
> 
>   I am skeptical why step 2 is needed at all. You never remove setting of `this->__end_`. So why do you need to do work that has already been done? Could you please verify that the second part is indeed necessary?
> 
>   If it is indeed necessary I note that you pessimize the slow path by decrementing and then incrementing.
> 
>   I would greatly prefer it if you would directly return in both the fast and the slow path.

The story is somewhat complicated. The compiler will optimize local types and keep them register allocated if possible. The 'if possible' here is not absolute, but more an 'if the compiler deems it possible".
As the main loop here is mostly trivial, and the vector has 3 words for state, we can easily 'see' it is possible to keep the vector state in 3 registers (if we include begin_).
For a compiler this is harder, as the following factors come in:

- the logic involves some non inlined code that is beyond the compilers view and may affect this / state
- state is modified at an inlining depth the compiler no longer tracks in full (tracking state is hard)
- 'this' or other state 'escapes', i.e., some code path could escape into globals, functions, etc, and the compiler can't proof that the state of the vector is not externally observable
- compiler heuristics required to determine that the slow path is unlikely and/or register allocation is easily preserved
- etc....

For an example of possible variants of making life 'as easy as possible' for the compiler, see https://pastebin.com/5YjHbSaC

Running these benchmarks:

  BM_Pushback<Vector<1>>/4k         1.67ns ± 8%
  BM_Pushback<Vector<2>>/4k         0.60ns ± 7%
  BM_Pushback<Vector<3>>/4k         0.35ns ± 7%

You'll see the top 2 are basically the numbers I posted earlier. The 3rd option is rewriting the logic such that the slow path is executed without 'this' state, but purely as inputs / outputs:

  void push_back(int value) {
      pointer end = end_;
      if (end == end_cap_) {
        size_type sz = end_ - begin_;
        size_type n = sz ? sz * 2 : 2;
        begin_ = __push_back_slow_path(begin_, sz, n, value);
        end_ = end = begin_ + sz;
        end_cap_ = begin_ + n;
      } else {
        *end = value;
        end_ = end + 1;
      }
  }

Which now has the fast path completely register allocated:

  20.20 │ ca:   cmp   %r13,%r14
             │     ↑ je    90
  20.16 │       mov   %r15d,(%r14)
  19.26 │       add   $0x4,%r14
  21.36 │       add   $0x1,%r15d
             │       cmp   %r15d,%r12d
  18.63 │     ↑ jne   ca

I cheated here in 2 ways: I elided allocator state. The default std::allocator is stateless, however, for a generic implementation we do need to pass allocator references along the call paths. This is part where refactoring this is harder, as this needs to be factored out in 'default allocator' and 'stateful allocator' code where the latter basically will have option 2 performance (only caching end_ state in register). Additionally, there is -fno-exception which makes optimizing this easier (there are no thousands of early exits) especially when it comes to setting / swapping state on grow events.

> Finally, your `__construct_one_at` helper function diverges from the previous code.
> 
> - The ASAN annotation is new and not done in the slow path.
> - `__construct_at_end` has the additional `_ConstructTransaction __tx(*this, 1);` I am new to the party but I am suspicious of exception safety here in case of a throwing constructor

The Transaction class purpose is to track the 'end' pointer (in the split buffer in these cases) for the added element, and to do some accounting (size grew) used in ASAN compilation which defines how many items are readable.
The 'construct one at end' case is easy, as there is only one failure point -> in place constructing the last element, so there is no need for any complicated state. Thus we can simply construct in place, and do the 'grow count for asan' which happens in the tx dtor.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80588/new/

https://reviews.llvm.org/D80588