[PATCH] D134982: [X86] Add support for "light" AVX
Ilya Tokar via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Oct 10 17:38:20 PDT 2022
TokarIP added a comment.
In D134982#3828850 <https://reviews.llvm.org/D134982#3828850>, @craig.topper wrote:
> I don’t think -mprefer-vector-width=128 has effect on most instructions. The 256 version was heavily integrated into the type legalization to split operations. I don’t think that was ever done for 128.
-mprefer-vector-width affects vectorizer decision to choose VL. Here is motivating example: https://godbolt.org/z/j8hrP5jhb
In D134982#3829073 <https://reviews.llvm.org/D134982#3829073>, @RKSimon wrote:
> Please do you have any more statistics on what range of machine(s) and test cases you've tried this on (compared to -mattr=prefer-128-bit/256-bit)?
Tested (128 + this vs plain 128) on AMD rome:
BM_Memcpy/0/0 [__llvm_libc::memcpy,memcpy Google A ] 19.2GB/s ± 3% 21.6GB/s ± 8% +12.44% (p=0.000 n=19+20)
BM_Memcpy/1/0 [__llvm_libc::memcpy,memcpy Google B ] 9.48GB/s ±11% 9.70GB/s ±10% ~ (p=0.228 n=18+20)
BM_Memcpy/2/0 [__llvm_libc::memcpy,memcpy Google D ] 33.0GB/s ± 2% 45.3GB/s ± 3% +37.08% (p=0.000 n=20+20)
BM_Memcpy/3/0 [__llvm_libc::memcpy,memcpy Google L ] 5.90GB/s ±17% 5.96GB/s ±19% ~ (p=0.835 n=19+20)
BM_Memcpy/4/0 [__llvm_libc::memcpy,memcpy Google M ] 6.55GB/s ±14% 6.87GB/s ±11% ~ (p=0.056 n=20+20)
BM_Memcpy/5/0 [__llvm_libc::memcpy,memcpy Google Q ] 3.74GB/s ±18% 3.55GB/s ±17% ~ (p=0.081 n=20+20)
BM_Memcpy/6/0 [__llvm_libc::memcpy,memcpy Google S ] 8.74GB/s ± 8% 9.16GB/s ± 7% +4.70% (p=0.002 n=18+20)
BM_Memcpy/7/0 [__llvm_libc::memcpy,memcpy Google U ] 9.79GB/s ±12% 10.38GB/s ±14% +6.01% (p=0.010 n=20+20)
BM_Memcpy/8/0 [__llvm_libc::memcpy,memcpy Google W ] 6.91GB/s ± 9% 7.24GB/s ± 8% +4.75% (p=0.001 n=19+20)
BM_Memcpy/9/0 [__llvm_libc::memcpy,uniform 384 to 4096 ] 43.2GB/s ± 1% 65.2GB/s ± 1% +50.69% (p=0.000 n=20+19)
Intel Skylake (server)
BM_Memcpy/0/0 [__llvm_libc::memcpy,memcpy Google A ] 18.1GB/s ± 9% 20.9GB/s ± 8% +15.58% (p=0.000 n=18+19)
BM_Memcpy/1/0 [__llvm_libc::memcpy,memcpy Google B ] 8.43GB/s ±14% 8.74GB/s ±18% ~ (p=0.175 n=19+20)
BM_Memcpy/2/0 [__llvm_libc::memcpy,memcpy Google D ] 34.5GB/s ± 3% 49.2GB/s ± 5% +42.88% (p=0.000 n=17+18)
BM_Memcpy/3/0 [__llvm_libc::memcpy,memcpy Google L ] 5.51GB/s ±29% 5.72GB/s ±19% ~ (p=0.461 n=20+19)
BM_Memcpy/4/0 [__llvm_libc::memcpy,memcpy Google M ] 5.57GB/s ±18% 5.72GB/s ±20% ~ (p=0.529 n=20+20)
BM_Memcpy/5/0 [__llvm_libc::memcpy,memcpy Google Q ] 2.97GB/s ±12% 3.15GB/s ±11% +6.08% (p=0.007 n=20+19)
BM_Memcpy/6/0 [__llvm_libc::memcpy,memcpy Google S ] 7.88GB/s ±15% 8.41GB/s ± 6% +6.68% (p=0.000 n=18+17)
BM_Memcpy/7/0 [__llvm_libc::memcpy,memcpy Google U ] 8.65GB/s ±19% 9.65GB/s ±17% +11.62% (p=0.001 n=20+20)
BM_Memcpy/8/0 [__llvm_libc::memcpy,memcpy Google W ] 6.17GB/s ±15% 6.41GB/s ±10% +3.75% (p=0.038 n=17+18)
BM_Memcpy/9/0 [__llvm_libc::memcpy,uniform 384 to 4096 ] 44.5GB/s ± 2% 70.0GB/s ± 9% +57.38% (p=0.000 n=16+17)
And Intel Haswell
BM_Memcpy/0/0 [__llvm_libc::memcpy,memcpy Google A ] 19.6GB/s ± 7% 22.5GB/s ± 8% +15.08% (p=0.000 n=20+20)
BM_Memcpy/1/0 [__llvm_libc::memcpy,memcpy Google B ] 9.15GB/s ± 5% 9.16GB/s ±13% ~ (p=0.798 n=17+20)
BM_Memcpy/2/0 [__llvm_libc::memcpy,memcpy Google D ] 37.4GB/s ± 6% 53.5GB/s ± 6% +42.95% (p=0.000 n=20+20)
BM_Memcpy/3/0 [__llvm_libc::memcpy,memcpy Google L ] 6.74GB/s ±17% 6.88GB/s ±17% ~ (p=0.461 n=20+19)
BM_Memcpy/4/0 [__llvm_libc::memcpy,memcpy Google M ] 6.56GB/s ± 5% 6.85GB/s ±16% ~ (p=0.105 n=18+20)
BM_Memcpy/5/0 [__llvm_libc::memcpy,memcpy Google Q ] 3.82GB/s ±18% 3.68GB/s ±24% ~ (p=0.253 n=20+20)
BM_Memcpy/6/0 [__llvm_libc::memcpy,memcpy Google S ] 8.75GB/s ± 9% 9.00GB/s ±14% ~ (p=0.211 n=20+20)
BM_Memcpy/7/0 [__llvm_libc::memcpy,memcpy Google U ] 10.2GB/s ±16% 10.6GB/s ±16% ~ (p=0.157 n=20+20)
BM_Memcpy/8/0 [__llvm_libc::memcpy,memcpy Google W ] 7.30GB/s ± 8% 7.42GB/s ±11% ~ (p=0.301 n=20+20)
BM_Memcpy/9/0 [__llvm_libc::memcpy,uniform 384 to 4096 ] 47.9GB/s ± 3% 77.3GB/s ± 6% +61.61% (p=0.000 n=19+20)
Internal loadtests shows 0.1-0.2% win vs -mprefer-vector-width=128. -mprefer-vector-width=256 causes several % performance regressions vs both this and plain 128.
================
Comment at: llvm/lib/Target/X86/X86ISelLowering.cpp:2665
if (Op.size() >= 32 && Subtarget.hasAVX() &&
- (Subtarget.getPreferVectorWidth() >= 256)) {
+ (Subtarget.getPreferVectorWidth() >= 256 || EnableLightAVX)) {
// Although this isn't a well-supported type for AVX1, we'll let
----------------
pengfei wrote:
> Here the check for 256 was intended from rG47272217 authored by @echristo.
> It looks to me it is the only difference between `prefer-128-bit` and `prefer-256-bit`. So I don't understand why using `-mattr=prefer-128-bit -x86-light-avx=true` rather than `prefer-256-bit`.
I'm not sure I understand the question. Building everything with prefer-256-bit means getting e.g 256-bit FMA and corresponding frequency penalty. I want to get 256-bit loads/stores because they are free performance win, but not "heavy" instructions.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D134982/new/
https://reviews.llvm.org/D134982
More information about the llvm-commits
mailing list