[PATCH] D91157: [AArch64] Out-of-line atomics (-moutline-atomics) implementation.

Sat Dec 5 18:03:57 PST 2020

sebpop added a comment.

I tested this change on Graviton2 aarch64-linux by building https://github.com/xianyi/OpenBLAS with `clang -O3 -moutline-atomics` and `make test`: all tests pass with and without outline-atomics.
Clang was configured to use libgcc.

I also tested https://github.com/boostorg/boost.git with and without -moutline-atomics, and there are no new fails.
Here is how I built and ran the tests for boost:

  git clone --recursive https://github.com/boostorg/boost.git $HOME/boost
  cd $HOME/boost
  mkdir usr
  ./bootstrap.sh --prefix=$HOME/boost/usr
  # in project-config.jam line 12
  # replace `using gcc ;` with `using clang :   : $HOME/llvm-project/usr/bin/clang++ ;`
  ./b2 --build-type=complete --layout=versioned -a
  cd status
  ../b2  # runs all regression tests

I also looked at the performance of some atomic operations using google-benchmark on Ubuntu 20.04 c6g instance with Graviton2 (Neoverse-N1).
Performance is better when using LSE instructions compared to generic armv8-a code. 
The overhead of -moutline-atomics is negligible compared to armv8-a+lse.
clang trunk as of today produces slightly slower code than gcc-9 with and without -moutline-atomics.

  $ cat a.cc
  #include <benchmark/benchmark.h>
  #include <atomic>

  std::atomic<int> i;
  static void BM_atomic_increment(benchmark::State& state) {
    for (auto _ : state)
      benchmark::DoNotOptimize(i++);
  }
  BENCHMARK(BM_atomic_increment);

  int j;
  static void BM_atomic_fetch_add(benchmark::State& state) {
    for (auto _ : state)
      benchmark::DoNotOptimize(__atomic_fetch_add(&j, 1, __ATOMIC_SEQ_CST));
  }
  BENCHMARK(BM_atomic_fetch_add);

  int k;
  static void BM_atomic_compare_exchange(benchmark::State& state) {
    for (auto _ : state)
      benchmark::DoNotOptimize(__atomic_compare_exchange
                               (&j, &k, &k, 1, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE));
  }
  BENCHMARK(BM_atomic_compare_exchange);

  template<class T>
  struct node {
    T data;
    node* next;
    node(const T& data) : data(data), next(nullptr) {}
  };

  static void BM_std_atomic_compare_exchange(benchmark::State& state) {
    node<int>* new_node = new node<int>(42);
    std::atomic<node<int>*> head;
    for (auto _ : state)
      benchmark::DoNotOptimize(std::atomic_compare_exchange_weak_explicit
                               (&head, &new_node->next, new_node,
                                std::memory_order_release,
                                std::memory_order_relaxed));
  }
  BENCHMARK(BM_std_atomic_compare_exchange);

  BENCHMARK_MAIN();

  ---
  $ ./go.sh
  + g++ -o generic-v8 a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
  + ./generic-v8
  2020-12-06 01:06:26
  Running ./generic-v8
  Run on (64 X 243.75 MHz CPU s)
  CPU Caches:
    L1 Data 64 KiB (x64)
    L1 Instruction 64 KiB (x64)
    L2 Unified 1024 KiB (x64)
    L3 Unified 32768 KiB (x1)
  Load Average: 64.36, 59.36, 36.41
  ***WARNING*** Library was built as DEBUG. Timings may be affected.
  -------------------------------------------------------------------------
  Benchmark                               Time             CPU   Iterations
  -------------------------------------------------------------------------
  BM_atomic_increment                  7.21 ns         7.20 ns     97116662
  BM_atomic_fetch_add                  7.20 ns         7.20 ns     97152394
  BM_atomic_compare_exchange           7.71 ns         7.71 ns     90780423
  BM_std_atomic_compare_exchange       7.61 ns         7.61 ns     92037159
  + /home/ubuntu/llvm-project/nin/bin/clang++ -o clang-generic-v8 a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
  + ./clang-generic-v8
  2020-12-06 01:06:30
  Running ./clang-generic-v8
  Run on (64 X 243.75 MHz CPU s)
  CPU Caches:
    L1 Data 64 KiB (x64)
    L1 Instruction 64 KiB (x64)
    L2 Unified 1024 KiB (x64)
    L3 Unified 32768 KiB (x1)
  Load Average: 64.57, 59.49, 36.57
  ***WARNING*** Library was built as DEBUG. Timings may be affected.
  -------------------------------------------------------------------------
  Benchmark                               Time             CPU   Iterations
  -------------------------------------------------------------------------
  BM_atomic_increment                  9.21 ns         9.21 ns     75989223
  BM_atomic_fetch_add                  9.21 ns         9.21 ns     76031211
  BM_atomic_compare_exchange           7.61 ns         7.61 ns     92012620
  BM_std_atomic_compare_exchange       12.4 ns         12.4 ns     56421424
  + g++ -o lse -march=armv8-a+lse a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
  + ./lse
  2020-12-06 01:06:34
  Running ./lse
  Run on (64 X 243.75 MHz CPU s)
  CPU Caches:
    L1 Data 64 KiB (x64)
    L1 Instruction 64 KiB (x64)
    L2 Unified 1024 KiB (x64)
    L3 Unified 32768 KiB (x1)
  Load Average: 64.85, 59.63, 36.74
  ***WARNING*** Library was built as DEBUG. Timings may be affected.
  -------------------------------------------------------------------------
  Benchmark                               Time             CPU   Iterations
  -------------------------------------------------------------------------
  BM_atomic_increment                  5.21 ns         5.21 ns    134201945
  BM_atomic_fetch_add                  5.21 ns         5.21 ns    134438848
  BM_atomic_compare_exchange           6.80 ns         6.80 ns    102872012
  BM_std_atomic_compare_exchange       6.80 ns         6.80 ns    102864719
  + clang++ -o clang-lse -march=armv8-a+lse a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
  + ./clang-lse
  2020-12-06 01:06:38
  Running ./clang-lse
  Run on (64 X 243.75 MHz CPU s)
  CPU Caches:
    L1 Data 64 KiB (x64)
    L1 Instruction 64 KiB (x64)
    L2 Unified 1024 KiB (x64)
    L3 Unified 32768 KiB (x1)
  Load Average: 64.85, 59.63, 36.74
  ***WARNING*** Library was built as DEBUG. Timings may be affected.
  -------------------------------------------------------------------------
  Benchmark                               Time             CPU   Iterations
  -------------------------------------------------------------------------
  BM_atomic_increment                  7.21 ns         7.21 ns     97086511
  BM_atomic_fetch_add                  7.21 ns         7.21 ns     97152416
  BM_atomic_compare_exchange           7.20 ns         7.20 ns     97186161
  BM_std_atomic_compare_exchange       11.6 ns         11.6 ns     60302378
  + g++ -o moutline -moutline-atomics a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
  + ./moutline
  2020-12-06 01:06:41
  Running ./moutline
  Run on (64 X 243.75 MHz CPU s)
  CPU Caches:
    L1 Data 64 KiB (x64)
    L1 Instruction 64 KiB (x64)
    L2 Unified 1024 KiB (x64)
    L3 Unified 32768 KiB (x1)
  Load Average: 64.94, 59.74, 36.90
  ***WARNING*** Library was built as DEBUG. Timings may be affected.
  -------------------------------------------------------------------------
  Benchmark                               Time             CPU   Iterations
  -------------------------------------------------------------------------
  BM_atomic_increment                  5.60 ns         5.60 ns    124853685
  BM_atomic_fetch_add                  5.60 ns         5.60 ns    124907943
  BM_atomic_compare_exchange           7.21 ns         7.21 ns     97151664
  BM_std_atomic_compare_exchange       7.21 ns         7.21 ns     97148224
  + /home/ubuntu/llvm-project/nin/bin/clang++ -o clang-moutline -moutline-atomics a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
  + ./clang-moutline
  2020-12-06 01:06:45
  Running ./clang-moutline
  Run on (64 X 243.75 MHz CPU s)
  CPU Caches:
    L1 Data 64 KiB (x64)
    L1 Instruction 64 KiB (x64)
    L2 Unified 1024 KiB (x64)
    L3 Unified 32768 KiB (x1)
  Load Average: 64.95, 59.82, 37.05
  ***WARNING*** Library was built as DEBUG. Timings may be affected.
  -------------------------------------------------------------------------
  Benchmark                               Time             CPU   Iterations
  -------------------------------------------------------------------------
  BM_atomic_increment                  7.21 ns         7.21 ns     97071465
  BM_atomic_fetch_add                  7.21 ns         7.20 ns     97150580
  BM_atomic_compare_exchange           7.20 ns         7.20 ns     97164566
  BM_std_atomic_compare_exchange       11.6 ns         11.6 ns     60301778

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D91157/new/

https://reviews.llvm.org/D91157