[PATCH] D101460: [SLP]Try to vectorize tiny trees with shuffled gathers of extractelements.

Alexey Bataev via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Thu Apr 29 05:33:51 PDT 2021


ABataev added inline comments.


================
Comment at: llvm/test/Transforms/SLPVectorizer/AArch64/ext-trunc.ll:21
 ; CHECK-NEXT:    [[E2:%.*]] = extractelement <4 x i32> [[SUB0]], i32 2
-; CHECK-NEXT:    [[S2:%.*]] = sext i32 [[E2]] to i64
-; CHECK-NEXT:    [[GEP2:%.*]] = getelementptr inbounds i64, i64* [[P]], i64 [[S2]]
+; CHECK-NEXT:    [[TMP0:%.*]] = insertelement <2 x i32> poison, i32 [[E1]], i32 0
+; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x i32> [[TMP0]], i32 [[E2]], i32 1
----------------
david-arm wrote:
> At first glance this looks worse, but I've tried out your patch and can see the generated code is the same because the entire first sequence of inserts, sext and trunc get folded away, since the sext + trunc is basically a no-op.
Yeah, llvm-mca gives throughput 13.5 without being vectorized and 15.5 with vectorized call (the diff is less for newer processors). Looks like another example of a known problem with too optimistic user cost compensation. This must go away once we land the proper implementation of insertelement instruction vectorization but I'll try to prepare a temp patch to try to improve the situation with this temporarily.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101460/new/

https://reviews.llvm.org/D101460



More information about the llvm-commits mailing list