[libcxx-commits] [PATCH] D80827: Optimize vector push_back to avoid continuous load and store of end pointer. (alt 2)

Fri May 29 12:00:38 PDT 2020

mvels created this revision.
Herald added a project: libc++.
Herald added a subscriber: libcxx-commits.
Herald added a reviewer: libc++.
mvels retitled this revision from "credits: this change is based on analysis and a proof of concept by gerbens at google.com." to "Optimize vector push_back to avoid continuous load and store of end pointer. (alt 2)".
mvels edited the summary of this revision.
mvels added a reviewer: EricWF.

credits: this change is based on analysis and a proof of concept by gerbens at google.com.

Before, the compiler loses track of end as 'this' and other references possibly escape beyond the compiler's scope. This can be see in the generated assembly:

  16.28 │200c80:   mov     %r15d,(%rax)
  60.87 │200c83:   add     $0x4,%rax
        │200c87:   mov     %rax,-0x38(%rbp)
   0.03 │200c8b: → jmpq    200d4e
   ...
   ...
   1.69 │200d4e:   cmp     %r15d,%r12d
        │200d51: → je      200c40
  16.34 │200d57:   inc     %r15d
   0.05 │200d5a:   mov     -0x38(%rbp),%rax
   3.27 │200d5e:   mov     -0x30(%rbp),%r13
   1.47 │200d62:   cmp     %r13,%rax
        │200d65: → jne     200c80

We fix this by always explicitly storing the loaded local and pointer back at the end of push back. This generates some slight source 'noise' in the slow paths, but creates nice and compact fast path code, i.e.:

  32.64 │200760:   mov    %r14d,(%r12)
   9.97 │200764:   add    $0x4,%r12
   6.97 │200768:   mov    %r12,-0x38(%rbp)
  32.17 │20076c:   add    $0x1,%r14d
   2.36 │200770:   cmp    %r14d,%ebx
        │200773: → je     200730
   8.98 │200775:   mov    -0x30(%rbp),%r13
   6.75 │200779:   cmp    %r13,%r12
        │20077c: → jne    200760

Now there is a single store for the pushback value (as before), and a single store for the end without a reload (dependency).
For full local vectors, (i.e., not referenced elsewhere), the capacity load and store inside the loop could also be removed, but this requires more substantial refactoring inside vector.

Benchmark based on:

  template <typename Vector>
  void BM_Pushback(benchmark::State& state) {
    int count = state.range(0);
    Vector vec;
    vec.reserve(count);
    while (state.KeepRunningBatch(count)) {
      vec.clear();
      for (int i = 0; i < count; ++i) {
        vec.push_back(i);
      }
      testing::DoNotOptimize(vec.data());
    }
  }

  BENCHMARK(BM_Pushback<std::vector<int>>)->Arg(4000);

Results:

  CPU: Intel Haswell with HyperThreading (32 cores) dL1:32KB dL2:256KB dL3:45MB
  name                                         old time/op             new time/op   delta
  -----------------------------------------------------------------------------------------
  BM_Pushback<std::vector<int>>/4k             1.61ns ± 0%             0.61ns ± 3%  -62.36%

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D80827

Files:
  libcxx/include/__config
  libcxx/include/vector

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D80827.267314.patch
Type: text/x-patch
Size: 5239 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/libcxx-commits/attachments/20200529/55a6c24a/attachment-0001.bin>