[PATCH] D96522: [LV] Try larger VFs if VF is unprofitable for small types.

Thu Feb 11 09:26:07 PST 2021

fhahn created this revision.
fhahn added reviewers: Ayal, hfinkel, anemet, dmgreen.
Herald added subscribers: hiraditya, kristof.beyls.
fhahn requested review of this revision.
Herald added a project: LLVM.

Currently LV can choose sub-optimal vectorization factors for loops with
memory accesses using different widths. At the moment, the largest type
limits the vectorization factor, but this is overly pessimistic on some
targets, which have memory instructions that require a certain minimum
VF for operations on narrow types.

The motivating example is AArch64, which requires a larger VFs for
vectorization to be profitable when narrow types are involved.

Currently code like below is not vectorized on AArch64, because the
chosen max VF of 4 (because the largest type is i32) is not profitable
(due to to type extensions).

  int foo(unsigned char *len, unsigned size) {
     int maxLen = 0;
     int minLen = 0;
     for (unsigned i = 0; i < size; i++) {
       if (len[i] > maxLen) maxLen = len[i];
       if (len[i] < minLen) minLen = len[i];
    }
    return maxLen + minLen;
  }

This patch addresses this issue by detecting cases where memory ops  for
the narrowest type are more expensive than with larger VFs. For such
cases, it instead considers larger vectorization factors, limited by
estimated register usage. Loops like the above can be speed-up by ~4x
on AArch64.

This change should not introduce regressions; we only explore more
vectorization factors, but the cost model still picks the most
profitable one.

The impact on SPEC2000 & SPEC2006 is relatively small:

  Tests: 31
  Same hash: 18 (filtered out)
  Remaining: 13
  Metric: loop-vectorize.LoopsVectorized
  
  test-suite...T2000/300.twolf/300.twolf.test 18.00 23.00 27.8%
  test-suite...T2000/256.bzip2/256.bzip2.test 12.00 14.00 16.7%
  test-suite...T2006/401.bzip2/401.bzip2.test 15.00 17.00 13.3%
  test-suite...T2006/445.gobmk/445.gobmk.test 25.00 27.00 8.0%
  test-suite...0/253.perlbmk/253.perlbmk.test 32.00 34.00 6.2%
  test-suite...000/186.crafty/186.crafty.test 19.00 20.00 5.3%
  test-suite...0.perlbench/400.perlbench.test 38.00 40.00 5.3%
  test-suite...T2006/456.hmmer/456.hmmer.test 63.00 65.00 3.2%
  test-suite...6/482.sphinx3/482.sphinx3.test 64.00 66.00 3.1%
  test-suite.../CINT2000/176.gcc/176.gcc.test 43.00 44.00 2.3%
  test-suite.../CINT2006/403.gcc/403.gcc.test 97.00 98.00 1.0%
  test-suite...3.xalancbmk/483.xalancbmk.test 271.00 273.00 0.7%
  test-suite...6/464.h264ref/464.h264ref.test 79.00 79.00 0.0%

There are a few small runtime improvements.

I also verified the changes to the vectorized loops in 300.twolf, 401.bzip2
& 445.gobmk. All changed loops are loops that the patch targets.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D96522

Files:
  llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
  llvm/test/Transforms/LoopVectorize/AArch64/extend-vectorization-factor-for-unprofitable-memops.ll


Index: llvm/test/Transforms/LoopVectorize/AArch64/extend-vectorization-factor-for-unprofitable-memops.ll
===================================================================

--- llvm/test/Transforms/LoopVectorize/AArch64/extend-vectorization-factor-for-unprofitable-memops.ll
+++ llvm/test/Transforms/LoopVectorize/AArch64/extend-vectorization-factor-for-unprofitable-memops.ll
@@ -9,7 +9,8 @@
 ; i8 memory accesses become profitable.
 define void @test_load_i8_store_i32(i8* noalias %src, i32* noalias %dst, i32 %off, i64 %N) {
 ; CHECK-LABEL: @test_load_i8_store_i32(
-; CHECK-NOT: x i8>
+; CHECK: <16 x i8>
+; CHECK: <16 x i32>
 ;
 entry:
   br label %loop
@@ -33,7 +34,8 @@
 ; Same as test_load_i8_store_i32, but with types flipped for load and store.
 define void @test_load_i32_store_i8(i32* noalias %src, i8* noalias %dst, i32 %off, i64 %N) {
 ; CHECK-LABEL: @test_load_i32_store_i8(
-; CHECK:     <4 x i8>
+; CHECK:     <16 x i32>
+; CHECK:     <16 x i8>
 ;
 entry:
   br label %loop
@@ -85,7 +87,8 @@
 ; vectorization factor.
 define void @test_load_i8_store_i64_large(i8* noalias %src, i64* noalias %dst, i64* noalias %dst.2, i64* noalias %dst.3, i64* noalias %dst.4, i64* noalias %dst.5, i64%off, i64 %off.2, i64 %N) {
 ; CHECK-LABEL: @test_load_i8_store_i64_large
-; CHECK: <2 x i64>
+; CHECK: <8 x i8>
+; CHECK: <8 x i64>
 ;
 entry:
   br label %loop
Index: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
===================================================================
--- llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -5783,9 +5783,29 @@
     return ElementCount::getFixed(ConstTripCount);
   }
 
+  LLVMContext &Context = TheLoop->getHeader()->getContext();
+  // The largest type limits the vectorization factor, but this can be too
+  // limiting when smaller memory operations are present, which are not
+  // legal/profitable with the chosen vectorization factor and are only
+  // profitable with larger vectorization factors.
+  //
+  // Try to detect such cases and try increasing the VF in those cases.
+  bool NarrowMemOpUnprofitable = false;
+  if (SmallestType <= 32 && SmallestType < WidestType &&
+      !MaxVectorSize.isScalable()) {
+    Type *SmallVT = FixedVectorType::get(
+        IntegerType::get(Context, SmallestType), MaxVectorSize.getFixedValue());
+    Type *SmallMaxPossibleVT =
+        FixedVectorType::get(IntegerType::get(Context, SmallestType),
+                             PowerOf2Floor(WidestRegister / SmallestType));
+    NarrowMemOpUnprofitable =
+        TTI.getMemoryOpCost(Instruction::Load, SmallVT, Align(1), 0) >
+        TTI.getMemoryOpCost(Instruction::Load, SmallMaxPossibleVT, Align(1), 0);
+  }
   ElementCount MaxVF = MaxVectorSize;
   if (TTI.shouldMaximizeVectorBandwidth(!isScalarEpilogueAllowed()) ||
-      (MaximizeBandwidth && isScalarEpilogueAllowed())) {
+      ((MaximizeBandwidth || NarrowMemOpUnprofitable) &&
+       isScalarEpilogueAllowed())) {
     // Collect all viable vectorization factors larger than the default MaxVF
     // (i.e. MaxVectorSize).
     SmallVector<ElementCount, 8> VFs;


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D96522.323050.patch
Type: text/x-patch
Size: 3133 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210211/980d030c/attachment.bin>