[PATCH] D57779: [SLP] Add support for throttling.

Sun Feb 10 16:17:25 PST 2019

dtemirbulatov marked 9 inline comments as done.
dtemirbulatov added inline comments.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:2436
+    TreeEntry *Entry = &VectorizableTree[I];
+    if (!Entry->NeedToGather) {
+      NumVectorized++;
----------------
vporpo wrote:
> Shouldn't we check Entry->Cost > 0 instead ? That would also take into consideration the spill costs.
ok, WIP.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:2438
+      NumVectorized++;
+      Sum += Entry->Cost;
+      Tree.push_back(Sum);
----------------
vporpo wrote:
> Not sure what is going on here. This line is the same in the else part (line 2442).  Move it out of the if ?
well, those lines were incorrect.

================
Comment at: lib/Transforms/Vectorize/SLPVectorizer.cpp:4855
       Changed = true;
+    } else {
+      // Try to reduce the tree to make it patrially vectorizable.
----------------
vporpo wrote:
> We should try to apply throttling even if it is profitable to vectorize.
> For example, the SLP graph could have two branches: one super profitable, say with cost -10, and another one with cost +8. The total is -2, which is still profitable, but the best solution would be to remove the bad branch.
> 
> In that case the call to reduceTreeCost() should be moved to buildTree().
ok, Let see what we will get with the proposed change on the benchmarks side.

================
Comment at: test/Transforms/SLPVectorizer/X86/ctlz.ll:76
 ; CHECK-LABEL: @ctlz_4i32(
-; CHECK-NEXT:    [[LD0:%.*]] = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 0), align 4
-; CHECK-NEXT:    [[LD1:%.*]] = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 1), align 4
-; CHECK-NEXT:    [[LD2:%.*]] = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 2), align 4
-; CHECK-NEXT:    [[LD3:%.*]] = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @src32, i32 0, i64 3), align 4
-; CHECK-NEXT:    [[CTLZ0:%.*]] = call i32 @llvm.ctlz.i32(i32 [[LD0]], i1 false)
-; CHECK-NEXT:    [[CTLZ1:%.*]] = call i32 @llvm.ctlz.i32(i32 [[LD1]], i1 false)
-; CHECK-NEXT:    [[CTLZ2:%.*]] = call i32 @llvm.ctlz.i32(i32 [[LD2]], i1 false)
-; CHECK-NEXT:    [[CTLZ3:%.*]] = call i32 @llvm.ctlz.i32(i32 [[LD3]], i1 false)
-; CHECK-NEXT:    store i32 [[CTLZ0]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 0), align 4
-; CHECK-NEXT:    store i32 [[CTLZ1]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 1), align 4
-; CHECK-NEXT:    store i32 [[CTLZ2]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 2), align 4
-; CHECK-NEXT:    store i32 [[CTLZ3]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @dst32, i32 0, i64 3), align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = load <4 x i32>, <4 x i32>* bitcast ([8 x i32]* @src32 to <4 x i32>*), align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> [[TMP1]], i1 false)
----------------
ABataev wrote:
> There is definitely something wrong with this patch. For example, in this case. There is no partial tree. Why is it vectorized?
Sorry for this, just cleaned the implementation.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D57779/new/

https://reviews.llvm.org/D57779