[PATCH] D44102: Teach CorrelatedValuePropagation to reduce the width of udiv/urem instructions.
Justin Lebar via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Mar 5 12:21:30 PST 2018
jlebar added a comment.
Disappointingly, this doesn't work for simple cases where you mask the divisor:
%b = and i64 %a, 65535
%div = udiv i64 %b, 42
It does work for llvm.assume, which I guess is good enough for the specific case I have, but...maybe this is not the right pass to be doing this in? Or should I check known-bits here too? Sorry, I'm an ignoramus when it comes to the target-independent parts of LLVM.
target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"
declare void @llvm.assume(i1)
define void @foo(i64 %a, i64* %ptr1, i64* %ptr2) {
%cond = icmp ult i64 %a, 1024
call void @llvm.assume(i1 %cond)
%div = udiv i64 %a, 42
%rem = urem i64 %a, 42
store i64 %div, i64* %ptr1
store i64 %rem, i64* %ptr2
ret void
}
becomes, at `opt -O2`
define void @foo(i64 %a, i64* nocapture %ptr1, i64* nocapture %ptr2) local_unnamed_addr #0 {
%cond = icmp ult i64 %a, 1024
tail call void @llvm.assume(i1 %cond)
%div.lhs.trunc = trunc i64 %a to i16
%div1 = udiv i16 %div.lhs.trunc, 42
%div.zext = zext i16 %div1 to i64
%1 = mul i16 %div1, 42
%2 = sub i16 %div.lhs.trunc, %1
%rem.zext = zext i16 %2 to i64
store i64 %div.zext, i64* %ptr1, align 8
store i64 %rem.zext, i64* %ptr2, align 8
ret void
}
which lowers to the following ptx:
shr.u16 %rs2, %rs1, 1;
mul.wide.u16 %r1, %rs2, -15603;
shr.u32 %r2, %r1, 20;
cvt.u16.u32 %rs3, %r2;
cvt.u64.u32 %rd3, %r2;
mul.lo.s16 %rs4, %rs3, 42;
sub.s16 %rs5, %rs1, %rs4;
cvt.u64.u16 %rd4, %rs5;
st.u64 [%rd1], %rd3;
st.u64 [%rd2], %rd4;
This is even nicer than before because we do the magic-number division in 16-widens-to-32-bit instead of (before) doing it in 32-widens-to-64 bit. At least, I hope that's efficient in NVPTX -- if not, that's our backend's problem. :)
https://reviews.llvm.org/D44102
More information about the llvm-commits
mailing list