[PATCH] D134982: [X86] Add support for "light" AVX

Fri Sep 30 12:00:44 PDT 2022

TokarIP created this revision.
TokarIP added a reviewer: craig.topper.
Herald added subscribers: StephenFan, pengfei, hiraditya.
Herald added a project: All.
TokarIP requested review of this revision.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.

AVX/AVX512 instructions may cause frequency drop on e.g. Skylake.
The magnitude of frequency/performance drop depends on instruction
(multiplication vs load/store) and vector width. Currently users,
that want to avoid this drop can specify -mprefer-vector-width=128.
However this also prevents generations of 256-bit wide instructions,
that have no associated frequency drop (mainly load/stores).

Add a flag that allows generations of 256-bit AVX load/stores,
even when -mprefer-vector-width=128 is set, to speed-up memcpy&co.
Flag is off by default, to avoid confusion when specifying
-mprefer-vector-width=128 and seeing 256-bit. Verified that
running memcpy loop on all cores has no frequency impact and 
zero CORE_POWER:LVL[12]_TURBO_LICENSE perf counters.

Makes coping memory faster:
BM_memcpy_aligned/256   80.7GB/s ± 3%   96.3GB/s ± 9%  +19.33%   (p=0.000 n=9+9


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D134982

Files:
  llvm/lib/Target/X86/X86ISelLowering.cpp
  llvm/test/CodeGen/X86/memcpy-light-avx.ll


Index: llvm/test/CodeGen/X86/memcpy-light-avx.ll
===================================================================

--- /dev/null
+++ llvm/test/CodeGen/X86/memcpy-light-avx.ll
@@ -0,0 +1,14 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=haswell -mattr=prefer-128-bit -x86-light-avx=true | FileCheck %s
+
+declare void @llvm.memcpy.p0.p0.i64(ptr nocapture, ptr nocapture, i64, i1) nounwind
+
+define void @test1(ptr %a, ptr %b) nounwind {
+; CHECK-LABEL: test1:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vmovups (%rsi), %ymm0
+; CHECK-NEXT:    vmovups %ymm0, (%rdi)
+; CHECK-NEXT:    vzeroupper
+; CHECK-NEXT:    retq
+  tail call void @llvm.memcpy.p0.p0.i64(ptr %a, ptr %b, i64 32, i1 0 )
+  ret void
+}
Index: llvm/lib/Target/X86/X86ISelLowering.cpp
===================================================================
--- llvm/lib/Target/X86/X86ISelLowering.cpp
+++ llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -93,6 +93,11 @@
              "stores respectively."),
     cl::Hidden);
 
+static cl::opt<bool>
+    EnableLightAVX("x86-light-avx", cl::init(false),
+                   cl::desc("Enable generation of 256 AVX load/stores,"
+                            "even whith -mprefered-vector-width=128"));
+
 /// Call this when the user attempts to do something unsupported, like
 /// returning a double without SSE2 enabled on x86_64. This is not fatal, unlike
 /// report_fatal_error, so calling code should attempt to recover without
@@ -2657,7 +2662,7 @@
       }
       // FIXME: Check if unaligned 32-byte accesses are slow.
       if (Op.size() >= 32 && Subtarget.hasAVX() &&
-          (Subtarget.getPreferVectorWidth() >= 256)) {
+          (Subtarget.getPreferVectorWidth() >= 256 || EnableLightAVX)) {
         // Although this isn't a well-supported type for AVX1, we'll let
         // legalization and shuffle lowering produce the optimal codegen. If we
         // choose an optimal type with a vector element larger than a byte,


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D134982.464345.patch
Type: text/x-patch
Size: 1961 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220930/945e1af8/attachment.bin>