[all-commits] [llvm/llvm-project] 90569e: [Support] Add Arm NEON implementation for `llvm::x...

Mon Jul 22 10:07:04 PDT 2024

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 90569e02e63ff5d0915446919f564e9b3638fe2a
      https://github.com/llvm/llvm-project/commit/90569e02e63ff5d0915446919f564e9b3638fe2a
  Author: Daniel Bertalan <dani at danielbertalan.dev>
  Date:   2024-07-22 (Mon, 22 Jul 2024)

  Changed paths:
    M llvm/benchmarks/CMakeLists.txt
    A llvm/benchmarks/xxhash.cpp
    M llvm/lib/Support/xxhash.cpp

  Log Message:
  -----------
  [Support] Add Arm NEON implementation for `llvm::xxh3_64bits` (#99634)

Compared to the generic scalar code, using Arm NEON instructions yields
a ~11x speedup: 31 vs 339.5 ms to hash 1 GiB of random data on the Apple
M1.

This follows the upstream implementation closely, with some
simplifications made:
- Removed workarounds for suboptimal codegen on older GCC
- Removed instruction reordering barriers which seem to have a
negligible impact according to my measurements
- We do not support WebAssembly's mostly NEON-compatible API
- There is no configurable mixing of SIMD and scalar code; according to
the upstream comments, this is only relevant for smaller Cortex cores
which can dispatch relatively few NEON micro-ops per cycle.

This commit intends to use only standard ACLE intrinsics and datatypes,
so it should build with all supported versions of GCC, Clang and MSVC.

This feature is enabled by default when targeting AArch64, but the
`LLVM_XXH_USE_NEON=0` macro can be set to explicitly disable it.

XXH3 is used for ICF, string deduplication and computing the UUID in
ld64.lld; this commit results in a -1.77% +/- 0.59% speed improvement
for a `--threads=8` link of Chromium.framework.

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications