[PATCH] D29088: Do not create ctlz/cttz(X, false) when the target do not support zero defined ctlz/cttz.

Thu Jan 26 05:45:47 PST 2017

andreadb added a comment.

Thanks Amaury,

Sorry in advance for this long post.

Your code example gave me some hints about what is going wrong.
I think I have been able to find a small reproducible for your particular scenario (see below).

  static int my_clz(int n) {
    return (n)? __builtin_clz(n) : 32;
  }

  int foo(int n) {
    unsigned val;
    if (n == -1)
      return 32;
    return my_clz(~n);
  }

'my_clz' matches the definition of __lzcnt32 from 'lzcntintrin.h'. Before we run SimplifyCFG, that function would look like this:

  define internal i32 @my_clz(i32 %n) {
  entry:
    %tobool = icmp ne i32 %n, 0
    br i1 %tobool, label %cond.true, label %cond.end

  cond.true:                                        ; preds = %entry
    %0 = call i32 @llvm.ctlz.i32(i32 %n, i1 true)
    br label %cond.end

  cond.end:                                         ; preds = %entry, %cond.true
    %cond = phi i32 [ %0, %cond.true ], [ 32, %entry ]
    ret i32 %cond
  }

SimplifyCFG would firstly speculate the (potentially expensive) call to llvm.ctlz, and then flatten the cfg by inserting a select:

  define internal i32 @my_clz(i32 %n) {
  entry:
    %tobool = icmp eq i32 %n, 0
    %0 = call i32 @llvm.ctlz.i32(i32 %n, i1 true)
    %cond = select i1 %tobool, i32 32, i32 %0
    ret i32 %cond
  }

At this point, our "problematic" instcombine kicks in, and that entire code sequence is simplified into `@llvm.ctlz.i32(i32 %n, i1 false)`.
I can see how this is going to be sub-optimal if we are in the following scenario:

1. We are building for a non-LZCNT target (example: SandyBridge), and
2. Our instcombine is triggered before `my_clz` is inlined into `foo`.

Before CodeGenPrepare, we would end up with code like this:

  define i32 @_Z3fooi(i32 %n) local_unnamed_addr #0 {
  entry:
    %cmp = icmp eq i32 %n, -1
    br i1 %cmp, label %cleanup, label %if.end

  if.end:                                           ; preds = %entry
    %neg = xor i32 %n, -1
    %0 = tail call i32 @llvm.ctlz.i32(i32 %neg, i1 false) #2   ;; <--- suboptimal flag!.
    br label %cleanup

  cleanup:                                          ; preds = %entry, %if.end
    %retval.0 = phi i32 [ %0, %if.end ], [ 32, %entry ]
    ret i32 %retval.0
  }

Basically, I can see how triggering that instcombine too prematurely might lead to poor codegen for your non-lzcnt target.

Ideally, we would want that canonicalization to be performed directly before (or during) codegen to help instruction selection on targets that have a fast ctz/clz defined on zero.

What if instead we move that transform into `CodeGenPrepare::optimizeSelectInst()`? I think that would fix the issue in a cleaner way, since we would not need to introduce new target hooks to conditionalize its execution.

-Andrea

https://reviews.llvm.org/D29088