[PATCH] D91157: [AArch64] Out-of-line atomics (-moutline-atomics) implementation.
Sebastian Pop via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Sat Dec 5 18:03:57 PST 2020
sebpop added a comment.
I tested this change on Graviton2 aarch64-linux by building https://github.com/xianyi/OpenBLAS with `clang -O3 -moutline-atomics` and `make test`: all tests pass with and without outline-atomics.
Clang was configured to use libgcc.
I also tested https://github.com/boostorg/boost.git with and without -moutline-atomics, and there are no new fails.
Here is how I built and ran the tests for boost:
git clone --recursive https://github.com/boostorg/boost.git $HOME/boost
cd $HOME/boost
mkdir usr
./bootstrap.sh --prefix=$HOME/boost/usr
# in project-config.jam line 12
# replace `using gcc ;` with `using clang : : $HOME/llvm-project/usr/bin/clang++ ;`
./b2 --build-type=complete --layout=versioned -a
cd status
../b2 # runs all regression tests
I also looked at the performance of some atomic operations using google-benchmark on Ubuntu 20.04 c6g instance with Graviton2 (Neoverse-N1).
Performance is better when using LSE instructions compared to generic armv8-a code.
The overhead of -moutline-atomics is negligible compared to armv8-a+lse.
clang trunk as of today produces slightly slower code than gcc-9 with and without -moutline-atomics.
$ cat a.cc
#include <benchmark/benchmark.h>
#include <atomic>
std::atomic<int> i;
static void BM_atomic_increment(benchmark::State& state) {
for (auto _ : state)
benchmark::DoNotOptimize(i++);
}
BENCHMARK(BM_atomic_increment);
int j;
static void BM_atomic_fetch_add(benchmark::State& state) {
for (auto _ : state)
benchmark::DoNotOptimize(__atomic_fetch_add(&j, 1, __ATOMIC_SEQ_CST));
}
BENCHMARK(BM_atomic_fetch_add);
int k;
static void BM_atomic_compare_exchange(benchmark::State& state) {
for (auto _ : state)
benchmark::DoNotOptimize(__atomic_compare_exchange
(&j, &k, &k, 1, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE));
}
BENCHMARK(BM_atomic_compare_exchange);
template<class T>
struct node {
T data;
node* next;
node(const T& data) : data(data), next(nullptr) {}
};
static void BM_std_atomic_compare_exchange(benchmark::State& state) {
node<int>* new_node = new node<int>(42);
std::atomic<node<int>*> head;
for (auto _ : state)
benchmark::DoNotOptimize(std::atomic_compare_exchange_weak_explicit
(&head, &new_node->next, new_node,
std::memory_order_release,
std::memory_order_relaxed));
}
BENCHMARK(BM_std_atomic_compare_exchange);
BENCHMARK_MAIN();
---
$ ./go.sh
+ g++ -o generic-v8 a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
+ ./generic-v8
2020-12-06 01:06:26
Running ./generic-v8
Run on (64 X 243.75 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x64)
L1 Instruction 64 KiB (x64)
L2 Unified 1024 KiB (x64)
L3 Unified 32768 KiB (x1)
Load Average: 64.36, 59.36, 36.41
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_atomic_increment 7.21 ns 7.20 ns 97116662
BM_atomic_fetch_add 7.20 ns 7.20 ns 97152394
BM_atomic_compare_exchange 7.71 ns 7.71 ns 90780423
BM_std_atomic_compare_exchange 7.61 ns 7.61 ns 92037159
+ /home/ubuntu/llvm-project/nin/bin/clang++ -o clang-generic-v8 a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
+ ./clang-generic-v8
2020-12-06 01:06:30
Running ./clang-generic-v8
Run on (64 X 243.75 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x64)
L1 Instruction 64 KiB (x64)
L2 Unified 1024 KiB (x64)
L3 Unified 32768 KiB (x1)
Load Average: 64.57, 59.49, 36.57
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_atomic_increment 9.21 ns 9.21 ns 75989223
BM_atomic_fetch_add 9.21 ns 9.21 ns 76031211
BM_atomic_compare_exchange 7.61 ns 7.61 ns 92012620
BM_std_atomic_compare_exchange 12.4 ns 12.4 ns 56421424
+ g++ -o lse -march=armv8-a+lse a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
+ ./lse
2020-12-06 01:06:34
Running ./lse
Run on (64 X 243.75 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x64)
L1 Instruction 64 KiB (x64)
L2 Unified 1024 KiB (x64)
L3 Unified 32768 KiB (x1)
Load Average: 64.85, 59.63, 36.74
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_atomic_increment 5.21 ns 5.21 ns 134201945
BM_atomic_fetch_add 5.21 ns 5.21 ns 134438848
BM_atomic_compare_exchange 6.80 ns 6.80 ns 102872012
BM_std_atomic_compare_exchange 6.80 ns 6.80 ns 102864719
+ clang++ -o clang-lse -march=armv8-a+lse a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
+ ./clang-lse
2020-12-06 01:06:38
Running ./clang-lse
Run on (64 X 243.75 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x64)
L1 Instruction 64 KiB (x64)
L2 Unified 1024 KiB (x64)
L3 Unified 32768 KiB (x1)
Load Average: 64.85, 59.63, 36.74
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_atomic_increment 7.21 ns 7.21 ns 97086511
BM_atomic_fetch_add 7.21 ns 7.21 ns 97152416
BM_atomic_compare_exchange 7.20 ns 7.20 ns 97186161
BM_std_atomic_compare_exchange 11.6 ns 11.6 ns 60302378
+ g++ -o moutline -moutline-atomics a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
+ ./moutline
2020-12-06 01:06:41
Running ./moutline
Run on (64 X 243.75 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x64)
L1 Instruction 64 KiB (x64)
L2 Unified 1024 KiB (x64)
L3 Unified 32768 KiB (x1)
Load Average: 64.94, 59.74, 36.90
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_atomic_increment 5.60 ns 5.60 ns 124853685
BM_atomic_fetch_add 5.60 ns 5.60 ns 124907943
BM_atomic_compare_exchange 7.21 ns 7.21 ns 97151664
BM_std_atomic_compare_exchange 7.21 ns 7.21 ns 97148224
+ /home/ubuntu/llvm-project/nin/bin/clang++ -o clang-moutline -moutline-atomics a.cc -std=c++11 -O2 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread
+ ./clang-moutline
2020-12-06 01:06:45
Running ./clang-moutline
Run on (64 X 243.75 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x64)
L1 Instruction 64 KiB (x64)
L2 Unified 1024 KiB (x64)
L3 Unified 32768 KiB (x1)
Load Average: 64.95, 59.82, 37.05
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_atomic_increment 7.21 ns 7.21 ns 97071465
BM_atomic_fetch_add 7.21 ns 7.20 ns 97150580
BM_atomic_compare_exchange 7.20 ns 7.20 ns 97164566
BM_std_atomic_compare_exchange 11.6 ns 11.6 ns 60301778
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D91157/new/
https://reviews.llvm.org/D91157
More information about the cfe-commits
mailing list