[libcxx-commits] [PATCH] D82809: Optimize vector push_back for hot loops / push_back fill invocations

Mon Jun 29 14:46:11 PDT 2020

mvels created this revision.
mvels added a reviewer: EricWF.
Herald added a project: libc++.
Herald added a subscriber: libcxx-commits.
Herald added a reviewer: libc++.

credits: this change is based on analysis and a proof of concept by gerbens at google.com.

With the old code, the compiler does not generate optimized code for inlined / compact code, i.e.:

  vector<int> myvec;
  for (...) {
    myvec.push_back(<Some Inlined Expression>);
  }

Before, the compiler loses track of the internal `end` and `end capacity` pointers as 'this' and other references possibly escape beyond the compiler's scope. This can be seen in the generated assembly:

  16.28 │200c80:   mov     %r15d,(%rax)
  60.87 │200c83:   add     $0x4,%rax
        │200c87:   mov     %rax,-0x38(%rbp)
   0.03 │200c8b: → jmpq    200d4e
   ...
   ...
   1.69 │200d4e:   cmp     %r15d,%r12d
        │200d51: → je      200c40
  16.34 │200d57:   inc     %r15d
   0.05 │200d5a:   mov     -0x38(%rbp),%rax
   3.27 │200d5e:   mov     -0x30(%rbp),%r13
   1.47 │200d62:   cmp     %r13,%rax
        │200d65: → jne     200c80

The compiler stores and reloads the `end` and `end_cap` pointers.

We fix this by always explicitly storing the loaded local and pointer back at the end of push back. This generates some slight source 'noise' in the slow paths, but creates nice and compact fast path code where the compiler now cashes the new `end` values, i.e.:

  32.64 │200760:   mov    %r14d,(%r12)
   9.97 │200764:   add    $0x4,%r12
   6.97 │200768:   mov    %r12,-0x38(%rbp)
  32.17 │20076c:   add    $0x1,%r14d
   2.36 │200770:   cmp    %r14d,%ebx
        │200773: → je     200730
   8.98 │200775:   mov    -0x30(%rbp),%r13
   6.75 │200779:   cmp    %r13,%r12
        │20077c: → jne    200760

Now there is a single store for the pushback value (as before), and a single store for the end without a reload (dependency).

For full local vectors, (i.e., not referenced elsewhere nor 'escaped`), the capacity load and store inside the loop may also be fully elided depending on the compiler and usage of the vector as per the below benchmark with a local vector, where a vector on some pod is then as fast as filling a plain array, i.e., one store per push_back:

  20.93 │ a0:  →mov    %r14d,0x0(%r13)
  19.29 │       add    $0x4,%r13
  18.34 │       add    $0x1,%r14d
  18.86 │       cmp    %r14d,%r15d
        │     ↓ je     190
  21.78 │ b5:   cmp    %rsi,%r13
        │       jne    a0

Benchmark based on:

  void BM_FillVectorPushback(benchmark::State& state) {
    int count = state.range(0);
    std::vector<int> vec;
    vec.reserve(count);
    while (state.KeepRunningBatch(count)) {
      vec.clear();
      for (int i = 0; i < count; ++i) {
        vec.push_back(i);
      }
      testing::DoNotOptimize(vec.data());
    }
  }

  BENCHMARK(BM_FillVectorPushback)->Arg(4000);

Results:

  name                                       old time/op             new time/op    delta
  ---------------------------------------------------------------------------------------
  BM_FillVectorPushback/1000                 1.74ns ± 6%             0.39ns ± 1%  -77.75%

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D82809

Files:
  libcxx/include/vector

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D82809.274241.patch
Type: text/x-patch
Size: 13407 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/libcxx-commits/attachments/20200629/8a45eca2/attachment.bin>