[PATCH] D29088: Do not create ctlz/cttz(X, false) when the target do not support zero defined ctlz/cttz.
Andrea Di Biagio via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Jan 26 05:45:47 PST 2017
andreadb added a comment.
Thanks Amaury,
Sorry in advance for this long post.
Your code example gave me some hints about what is going wrong.
I think I have been able to find a small reproducible for your particular scenario (see below).
static int my_clz(int n) {
return (n)? __builtin_clz(n) : 32;
}
int foo(int n) {
unsigned val;
if (n == -1)
return 32;
return my_clz(~n);
}
'my_clz' matches the definition of __lzcnt32 from 'lzcntintrin.h'. Before we run SimplifyCFG, that function would look like this:
define internal i32 @my_clz(i32 %n) {
entry:
%tobool = icmp ne i32 %n, 0
br i1 %tobool, label %cond.true, label %cond.end
cond.true: ; preds = %entry
%0 = call i32 @llvm.ctlz.i32(i32 %n, i1 true)
br label %cond.end
cond.end: ; preds = %entry, %cond.true
%cond = phi i32 [ %0, %cond.true ], [ 32, %entry ]
ret i32 %cond
}
SimplifyCFG would firstly speculate the (potentially expensive) call to llvm.ctlz, and then flatten the cfg by inserting a select:
define internal i32 @my_clz(i32 %n) {
entry:
%tobool = icmp eq i32 %n, 0
%0 = call i32 @llvm.ctlz.i32(i32 %n, i1 true)
%cond = select i1 %tobool, i32 32, i32 %0
ret i32 %cond
}
At this point, our "problematic" instcombine kicks in, and that entire code sequence is simplified into `@llvm.ctlz.i32(i32 %n, i1 false)`.
I can see how this is going to be sub-optimal if we are in the following scenario:
1. We are building for a non-LZCNT target (example: SandyBridge), and
2. Our instcombine is triggered before `my_clz` is inlined into `foo`.
Before CodeGenPrepare, we would end up with code like this:
define i32 @_Z3fooi(i32 %n) local_unnamed_addr #0 {
entry:
%cmp = icmp eq i32 %n, -1
br i1 %cmp, label %cleanup, label %if.end
if.end: ; preds = %entry
%neg = xor i32 %n, -1
%0 = tail call i32 @llvm.ctlz.i32(i32 %neg, i1 false) #2 ;; <--- suboptimal flag!.
br label %cleanup
cleanup: ; preds = %entry, %if.end
%retval.0 = phi i32 [ %0, %if.end ], [ 32, %entry ]
ret i32 %retval.0
}
Basically, I can see how triggering that instcombine too prematurely might lead to poor codegen for your non-lzcnt target.
Ideally, we would want that canonicalization to be performed directly before (or during) codegen to help instruction selection on targets that have a fast ctz/clz defined on zero.
What if instead we move that transform into `CodeGenPrepare::optimizeSelectInst()`? I think that would fix the issue in a cleaner way, since we would not need to introduce new target hooks to conditionalize its execution.
-Andrea
https://reviews.llvm.org/D29088
More information about the llvm-commits
mailing list