[PATCH] D134982: [X86] Add support for "light" AVX

Mon Oct 10 17:38:20 PDT 2022

TokarIP added a comment.

In D134982#3828850 <https://reviews.llvm.org/D134982#3828850>, @craig.topper wrote:

> I don’t think -mprefer-vector-width=128 has effect on most instructions. The 256 version was heavily integrated into the type legalization to split operations. I don’t think that was ever done for 128.

-mprefer-vector-width affects vectorizer decision to choose VL. Here is motivating example: https://godbolt.org/z/j8hrP5jhb

In D134982#3829073 <https://reviews.llvm.org/D134982#3829073>, @RKSimon wrote:

> Please do you have any more statistics on what range of machine(s) and test cases you've tried this on (compared to -mattr=prefer-128-bit/256-bit)?

Tested  (128 + this vs plain 128) on AMD rome:
BM_Memcpy/0/0  [__llvm_libc::memcpy,memcpy Google A     ]  19.2GB/s ± 3%           21.6GB/s ± 8%  +12.44%        (p=0.000 n=19+20)
BM_Memcpy/1/0  [__llvm_libc::memcpy,memcpy Google B     ]  9.48GB/s ±11%           9.70GB/s ±10%     ~           (p=0.228 n=18+20)
BM_Memcpy/2/0  [__llvm_libc::memcpy,memcpy Google D     ]  33.0GB/s ± 2%           45.3GB/s ± 3%  +37.08%        (p=0.000 n=20+20)
BM_Memcpy/3/0  [__llvm_libc::memcpy,memcpy Google L     ]  5.90GB/s ±17%           5.96GB/s ±19%     ~           (p=0.835 n=19+20)
BM_Memcpy/4/0  [__llvm_libc::memcpy,memcpy Google M     ]  6.55GB/s ±14%           6.87GB/s ±11%     ~           (p=0.056 n=20+20)
BM_Memcpy/5/0  [__llvm_libc::memcpy,memcpy Google Q     ]  3.74GB/s ±18%           3.55GB/s ±17%     ~           (p=0.081 n=20+20)
BM_Memcpy/6/0  [__llvm_libc::memcpy,memcpy Google S     ]  8.74GB/s ± 8%           9.16GB/s ± 7%   +4.70%        (p=0.002 n=18+20)
BM_Memcpy/7/0  [__llvm_libc::memcpy,memcpy Google U     ]  9.79GB/s ±12%          10.38GB/s ±14%   +6.01%        (p=0.010 n=20+20)
BM_Memcpy/8/0  [__llvm_libc::memcpy,memcpy Google W     ]  6.91GB/s ± 9%           7.24GB/s ± 8%   +4.75%        (p=0.001 n=19+20)
BM_Memcpy/9/0  [__llvm_libc::memcpy,uniform 384 to 4096 ]  43.2GB/s ± 1%           65.2GB/s ± 1%  +50.69%        (p=0.000 n=20+19)

Intel Skylake (server)
 BM_Memcpy/0/0  [__llvm_libc::memcpy,memcpy Google A     ]  18.1GB/s ± 9%           20.9GB/s ± 8%  +15.58%        (p=0.000 n=18+19)
BM_Memcpy/1/0  [__llvm_libc::memcpy,memcpy Google B     ]  8.43GB/s ±14%           8.74GB/s ±18%     ~           (p=0.175 n=19+20)
BM_Memcpy/2/0  [__llvm_libc::memcpy,memcpy Google D     ]  34.5GB/s ± 3%           49.2GB/s ± 5%  +42.88%        (p=0.000 n=17+18)
BM_Memcpy/3/0  [__llvm_libc::memcpy,memcpy Google L     ]  5.51GB/s ±29%           5.72GB/s ±19%     ~           (p=0.461 n=20+19)
BM_Memcpy/4/0  [__llvm_libc::memcpy,memcpy Google M     ]  5.57GB/s ±18%           5.72GB/s ±20%     ~           (p=0.529 n=20+20)
BM_Memcpy/5/0  [__llvm_libc::memcpy,memcpy Google Q     ]  2.97GB/s ±12%           3.15GB/s ±11%   +6.08%        (p=0.007 n=20+19)
BM_Memcpy/6/0  [__llvm_libc::memcpy,memcpy Google S     ]  7.88GB/s ±15%           8.41GB/s ± 6%   +6.68%        (p=0.000 n=18+17)
BM_Memcpy/7/0  [__llvm_libc::memcpy,memcpy Google U     ]  8.65GB/s ±19%           9.65GB/s ±17%  +11.62%        (p=0.001 n=20+20)
BM_Memcpy/8/0  [__llvm_libc::memcpy,memcpy Google W     ]  6.17GB/s ±15%           6.41GB/s ±10%   +3.75%        (p=0.038 n=17+18)
BM_Memcpy/9/0  [__llvm_libc::memcpy,uniform 384 to 4096 ]  44.5GB/s ± 2%           70.0GB/s ± 9%  +57.38%        (p=0.000 n=16+17)

And Intel Haswell
BM_Memcpy/0/0  [__llvm_libc::memcpy,memcpy Google A     ]  19.6GB/s ± 7%           22.5GB/s ± 8%  +15.08%        (p=0.000 n=20+20)
BM_Memcpy/1/0  [__llvm_libc::memcpy,memcpy Google B     ]  9.15GB/s ± 5%           9.16GB/s ±13%     ~           (p=0.798 n=17+20)
BM_Memcpy/2/0  [__llvm_libc::memcpy,memcpy Google D     ]  37.4GB/s ± 6%           53.5GB/s ± 6%  +42.95%        (p=0.000 n=20+20)
BM_Memcpy/3/0  [__llvm_libc::memcpy,memcpy Google L     ]  6.74GB/s ±17%           6.88GB/s ±17%     ~           (p=0.461 n=20+19)
BM_Memcpy/4/0  [__llvm_libc::memcpy,memcpy Google M     ]  6.56GB/s ± 5%           6.85GB/s ±16%     ~           (p=0.105 n=18+20)
BM_Memcpy/5/0  [__llvm_libc::memcpy,memcpy Google Q     ]  3.82GB/s ±18%           3.68GB/s ±24%     ~           (p=0.253 n=20+20)
BM_Memcpy/6/0  [__llvm_libc::memcpy,memcpy Google S     ]  8.75GB/s ± 9%           9.00GB/s ±14%     ~           (p=0.211 n=20+20)
BM_Memcpy/7/0  [__llvm_libc::memcpy,memcpy Google U     ]  10.2GB/s ±16%           10.6GB/s ±16%     ~           (p=0.157 n=20+20)
BM_Memcpy/8/0  [__llvm_libc::memcpy,memcpy Google W     ]  7.30GB/s ± 8%           7.42GB/s ±11%     ~           (p=0.301 n=20+20)
BM_Memcpy/9/0  [__llvm_libc::memcpy,uniform 384 to 4096 ]  47.9GB/s ± 3%           77.3GB/s ± 6%  +61.61%        (p=0.000 n=19+20)

Internal loadtests shows 0.1-0.2% win vs -mprefer-vector-width=128. -mprefer-vector-width=256 causes several % performance regressions vs both this and plain 128.

================
Comment at: llvm/lib/Target/X86/X86ISelLowering.cpp:2665
       if (Op.size() >= 32 && Subtarget.hasAVX() &&
-          (Subtarget.getPreferVectorWidth() >= 256)) {
+          (Subtarget.getPreferVectorWidth() >= 256 || EnableLightAVX)) {
         // Although this isn't a well-supported type for AVX1, we'll let
----------------
pengfei wrote:
> Here the check for 256 was intended from rG47272217 authored by @echristo.
> It looks to me it is the only difference between `prefer-128-bit` and `prefer-256-bit`. So I don't understand why using `-mattr=prefer-128-bit -x86-light-avx=true` rather than `prefer-256-bit`.
I'm not sure I understand the question. Building everything with prefer-256-bit means getting e.g 256-bit FMA and corresponding frequency penalty. I want to get 256-bit loads/stores because they are free performance win, but not "heavy" instructions.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D134982/new/

https://reviews.llvm.org/D134982