[llvm-commits] [llvm] r123547 - /llvm/trunk/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

Sat Jan 15 17:02:34 PST 2011

On Jan 15, 2011, at 12:30 PM, Benjamin Kramer wrote:

> Author: d0k
> Date: Sat Jan 15 14:30:30 2011
> New Revision: 123547
> 
> URL: http://llvm.org/viewvc/llvm-project?rev=123547&view=rev
> Log:
> Reimplement CTPOP legalization with the "best" algorithm from
> http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
> 
> In a silly microbenchmark on a 65 nm core2 this is 1.5x faster than the old
> code in 32 bit mode and about 2x faster in 64 bit mode. It's also a lot shorter,
> especially when counting 64 bit population on a 32 bit target.
> 
> I hope this is fast enough to replace Kernighan-style counting loops even when
> the input is rather sparse.

Wow, very nice!  This is quite a bit faster than the old one.  When I replace the manual ctpop loops in crafty with builtin_popcount, the performance still slows down from 5.5s to 5.62s though.  This is very close to making me want to turn it on, particularly given that this is almost certainly a win for all targets with popcount and 32-bit platforms without them.

One major advantage of recognizing popcount though is that the optimizer has a chance of hacking on it, and there are a lot of instcombine xforms that we could do.  Here are some that I note looking at the bc file for crafy (when hacked to use builtin_popcount).  I attached the bc file below if you're interested.

One interesting thing that I see is that the calls are often of the form:

  icmp_ugt(ctpop(a), ctpop(b))

I wonder if there is some clever optimization for that case.

Another case I see in a few places is:

  %331 = tail call i64 @llvm.ctpop.i64(i64 %330) nounwind
  %cast.i137 = trunc i64 %331 to i32
  %332 = icmp ugt i32 %cast.i137, 1

Where both the trunc and the ctpop have one use.

There are a few other interesting patterns that seem simplifiable:

; <label>:178                                     ; preds = %176
...
  %183 = tail call i64 @llvm.ctpop.i64(i64 %182) nounwind
  %cast.i107 = trunc i64 %183 to i32
  %184 = getelementptr inbounds [64 x i64]* @w_pawn_attacks, i64 0, i64 %179
  %185 = load i64* %184, align 8, !tbaa !2
  %186 = and i64 %68, %185
  %187 = tail call i64 @llvm.ctpop.i64(i64 %186) nounwind
  %cast.i108 = trunc i64 %187 to i32
  %188 = getelementptr inbounds [65 x i64]* @set_mask, i64 0, i64 %179
  %189 = load i64* %188, align 8, !tbaa !2
  %190 = and i64 %120, %189
  %191 = icmp eq i64 %190, 0
  br i1 %191, label %192, label %.thread207

; <label>:192                                     ; preds = %178
  %193 = icmp ugt i32 %cast.i108, %cast.i107
  br i1 %193, label %197, label %194

; <label>:194                                     ; preds = %192
  %195 = icmp eq i32 %cast.i107, 0
  %196 = icmp ult i32 %cast.i107, %cast.i108
  %or.cond246 = or i1 %195, %196
  %indvar.next433 = add i32 %indvar432, 1
  br i1 %or.cond246, label %176, label %.thread218

; <label>:197                                     ; preds = %192
  %198 = sub nsw i32 %cast.i108, %cast.i107
  %199 = icmp eq i32 %cast.i108, %cast.i107
  br i1 %199, label %.thread218, label %.thread207

"%195" seems like it is just "icmp ne i64 %182, 0"

%196 seems like it is the same thing as %193.  I wonder if we should always canonicalizing icmps to "lt" comparisons when both operands are non-constant.  It seems that this would expose more CSEs.

Here another case that might allow cleverness:

  %437 = tail call i64 @llvm.ctpop.i64(i64 %436) nounwind
  %cast.i138 = trunc i64 %437 to i32
...
  %440 = tail call i64 @llvm.ctpop.i64(i64 %439) nounwind
  %cast.i139 = trunc i64 %440 to i32
  %441 = sub nsw i32 %cast.i139, %cast.i138
  %442 = icmp eq i32 %441, 2

Here's another obvious case:

  %592 = tail call i64 @llvm.ctpop.i64(i64 %591) nounwind
  %cast.i176 = trunc i64 %592 to i32
  %593 = icmp eq i32 %cast.i176, 0

In this case, 592 has multiple uses.  It seems that we should be able to eliminate the trunc though since we know the top bits are zero.

Here's another interesting pattern:

  %778 = tail call i64 @llvm.ctpop.i64(i64 %777) nounwind
  %cast.i195 = trunc i64 %778 to i32
...
  %781 = tail call i64 @llvm.ctpop.i64(i64 %780) nounwind
  %cast.i196 = trunc i64 %781 to i32
  %782 = sub nsw i32 %cast.i195, %cast.i196
  %783 = icmp eq i32 %782, 2
  br i1 %783, label %.loopexit.thread, label %784

; <label>:784                                     ; preds = %775
  switch i32 %cast.i195, label %.thread245 [
    i32 1, label %785
    i32 0, label %786
  ]

; <label>:785                                     ; preds = %784
  %.not9 = icmp ne i32 %cast.i196, 0
  ...

And:
  %1037 = tail call i64 @llvm.ctpop.i64(i64 %1036) nounwind
  %1038 = icmp eq i64 %1037, 3

These "popcount = 3" and "popcount < 2" sorts of cases seems that they could use a couple iterations of the unrolled "a &= a-1" checks or something, instead of computing the full computation.

For example, the top of GenerateCheckEvasions has "popcount(x) == 1" which seems that it could be something like "x != 0 && (x & (x-1) == 0)" cheaper than expanding the popcount.  This sort of thing is a bad idea of ctpop expands to a single cycle instruction though, so this is probably best to do in dag combine instead of instcombine.

Anyway, if you're interested in poking at it, here's the hacked bc file:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 186.crafty.llvm.bc
Type: application/octet-stream
Size: 401552 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20110115/dc9747b7/attachment.obj>
-------------- next part --------------

-Chris