[llvm] ae08362 - [X86][Costmodel] Load/store i8 Stride=4 VF=4 interleaving costs

Sat Oct 2 03:52:40 PDT 2021

Author: Roman Lebedev
Date: 2021-10-02T13:40:20+03:00
New Revision: ae08362cb8e60864a0505af47189d6a996cfb5d9

URL: https://github.com/llvm/llvm-project/commit/ae08362cb8e60864a0505af47189d6a996cfb5d9
DIFF: https://github.com/llvm/llvm-project/commit/ae08362cb8e60864a0505af47189d6a996cfb5d9.diff

LOG: [X86][Costmodel] Load/store i8 Stride=4 VF=4 interleaving costs

While we already model this tuple, the store cost is divergent from reality, so fix it.

The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/1n4bPh7Tn - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

For store we have:
https://godbolt.org/z/r8K9sveqo - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110968

Added: 
    

Modified: 
    llvm/lib/Target/X86/X86TargetTransformInfo.cpp
    llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-4.ll

Removed: 
    


################################################################################
diff  --git a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
index 992a616f3b54..6e3aaba56222 100644

--- a/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
+++ b/llvm/lib/Target/X86/X86TargetTransformInfo.cpp
@@ -5145,7 +5145,7 @@ InstructionCost X86TTIImpl::getInterleavedMemoryOpCostAVX2(
       {3, MVT::v32i8, 13}, // interleave 3 x 32i8 into 96i8 (and store)
 
       {4, MVT::v2i8, 4},  // interleave 4 x 2i8 into 8i8 (and store)
-      {4, MVT::v4i8, 9},   // interleave 4 x 4i8 into 16i8 (and store)
+      {4, MVT::v4i8, 4},   // interleave 4 x 4i8 into 16i8 (and store)
       {4, MVT::v8i8, 10},  // interleave 4 x 8i8 into 32i8 (and store)
       {4, MVT::v16i8, 10}, // interleave 4 x 16i8 into 64i8 (and store)
       {4, MVT::v32i8, 12}, // interleave 4 x 32i8 into 128i8 (and store)

diff  --git a/llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-4.ll b/llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-4.ll
index 48da565a0a0e..5b2e7fae6cf5 100644
--- a/llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-4.ll
+++ b/llvm/test/Analysis/CostModel/X86/interleaved-store-i8-stride-4.ll
@@ -27,7 +27,7 @@ target triple = "x86_64-unknown-linux-gnu"
 ;
 ; AVX2: LV: Found an estimated cost of 1 for VF 1 For instruction:   store i8 %v3, i8* %out3, align 1
 ; AVX2: LV: Found an estimated cost of 5 for VF 2 For instruction:   store i8 %v3, i8* %out3, align 1
-; AVX2: LV: Found an estimated cost of 10 for VF 4 For instruction:   store i8 %v3, i8* %out3, align 1
+; AVX2: LV: Found an estimated cost of 5 for VF 4 For instruction:   store i8 %v3, i8* %out3, align 1
 ; AVX2: LV: Found an estimated cost of 11 for VF 8 For instruction:   store i8 %v3, i8* %out3, align 1
 ; AVX2: LV: Found an estimated cost of 12 for VF 16 For instruction:   store i8 %v3, i8* %out3, align 1
 ; AVX2: LV: Found an estimated cost of 16 for VF 32 For instruction:   store i8 %v3, i8* %out3, align 1