[llvm-commits] [llvm] r123547 - /llvm/trunk/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
Chris Lattner
clattner at apple.com
Sat Jan 15 17:02:34 PST 2011
On Jan 15, 2011, at 12:30 PM, Benjamin Kramer wrote:
> Author: d0k
> Date: Sat Jan 15 14:30:30 2011
> New Revision: 123547
>
> URL: http://llvm.org/viewvc/llvm-project?rev=123547&view=rev
> Log:
> Reimplement CTPOP legalization with the "best" algorithm from
> http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
>
> In a silly microbenchmark on a 65 nm core2 this is 1.5x faster than the old
> code in 32 bit mode and about 2x faster in 64 bit mode. It's also a lot shorter,
> especially when counting 64 bit population on a 32 bit target.
>
> I hope this is fast enough to replace Kernighan-style counting loops even when
> the input is rather sparse.
Wow, very nice! This is quite a bit faster than the old one. When I replace the manual ctpop loops in crafty with builtin_popcount, the performance still slows down from 5.5s to 5.62s though. This is very close to making me want to turn it on, particularly given that this is almost certainly a win for all targets with popcount and 32-bit platforms without them.
One major advantage of recognizing popcount though is that the optimizer has a chance of hacking on it, and there are a lot of instcombine xforms that we could do. Here are some that I note looking at the bc file for crafy (when hacked to use builtin_popcount). I attached the bc file below if you're interested.
One interesting thing that I see is that the calls are often of the form:
icmp_ugt(ctpop(a), ctpop(b))
I wonder if there is some clever optimization for that case.
Another case I see in a few places is:
%331 = tail call i64 @llvm.ctpop.i64(i64 %330) nounwind
%cast.i137 = trunc i64 %331 to i32
%332 = icmp ugt i32 %cast.i137, 1
Where both the trunc and the ctpop have one use.
There are a few other interesting patterns that seem simplifiable:
; <label>:178 ; preds = %176
...
%183 = tail call i64 @llvm.ctpop.i64(i64 %182) nounwind
%cast.i107 = trunc i64 %183 to i32
%184 = getelementptr inbounds [64 x i64]* @w_pawn_attacks, i64 0, i64 %179
%185 = load i64* %184, align 8, !tbaa !2
%186 = and i64 %68, %185
%187 = tail call i64 @llvm.ctpop.i64(i64 %186) nounwind
%cast.i108 = trunc i64 %187 to i32
%188 = getelementptr inbounds [65 x i64]* @set_mask, i64 0, i64 %179
%189 = load i64* %188, align 8, !tbaa !2
%190 = and i64 %120, %189
%191 = icmp eq i64 %190, 0
br i1 %191, label %192, label %.thread207
; <label>:192 ; preds = %178
%193 = icmp ugt i32 %cast.i108, %cast.i107
br i1 %193, label %197, label %194
; <label>:194 ; preds = %192
%195 = icmp eq i32 %cast.i107, 0
%196 = icmp ult i32 %cast.i107, %cast.i108
%or.cond246 = or i1 %195, %196
%indvar.next433 = add i32 %indvar432, 1
br i1 %or.cond246, label %176, label %.thread218
; <label>:197 ; preds = %192
%198 = sub nsw i32 %cast.i108, %cast.i107
%199 = icmp eq i32 %cast.i108, %cast.i107
br i1 %199, label %.thread218, label %.thread207
"%195" seems like it is just "icmp ne i64 %182, 0"
%196 seems like it is the same thing as %193. I wonder if we should always canonicalizing icmps to "lt" comparisons when both operands are non-constant. It seems that this would expose more CSEs.
Here another case that might allow cleverness:
%437 = tail call i64 @llvm.ctpop.i64(i64 %436) nounwind
%cast.i138 = trunc i64 %437 to i32
...
%440 = tail call i64 @llvm.ctpop.i64(i64 %439) nounwind
%cast.i139 = trunc i64 %440 to i32
%441 = sub nsw i32 %cast.i139, %cast.i138
%442 = icmp eq i32 %441, 2
Here's another obvious case:
%592 = tail call i64 @llvm.ctpop.i64(i64 %591) nounwind
%cast.i176 = trunc i64 %592 to i32
%593 = icmp eq i32 %cast.i176, 0
In this case, 592 has multiple uses. It seems that we should be able to eliminate the trunc though since we know the top bits are zero.
Here's another interesting pattern:
%778 = tail call i64 @llvm.ctpop.i64(i64 %777) nounwind
%cast.i195 = trunc i64 %778 to i32
...
%781 = tail call i64 @llvm.ctpop.i64(i64 %780) nounwind
%cast.i196 = trunc i64 %781 to i32
%782 = sub nsw i32 %cast.i195, %cast.i196
%783 = icmp eq i32 %782, 2
br i1 %783, label %.loopexit.thread, label %784
; <label>:784 ; preds = %775
switch i32 %cast.i195, label %.thread245 [
i32 1, label %785
i32 0, label %786
]
; <label>:785 ; preds = %784
%.not9 = icmp ne i32 %cast.i196, 0
...
And:
%1037 = tail call i64 @llvm.ctpop.i64(i64 %1036) nounwind
%1038 = icmp eq i64 %1037, 3
These "popcount = 3" and "popcount < 2" sorts of cases seems that they could use a couple iterations of the unrolled "a &= a-1" checks or something, instead of computing the full computation.
For example, the top of GenerateCheckEvasions has "popcount(x) == 1" which seems that it could be something like "x != 0 && (x & (x-1) == 0)" cheaper than expanding the popcount. This sort of thing is a bad idea of ctpop expands to a single cycle instruction though, so this is probably best to do in dag combine instead of instcombine.
Anyway, if you're interested in poking at it, here's the hacked bc file:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 186.crafty.llvm.bc
Type: application/octet-stream
Size: 401552 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20110115/dc9747b7/attachment.obj>
-------------- next part --------------
-Chris
More information about the llvm-commits
mailing list