Just for record, here's what I was doing wrong.<br><br>!0 = metadata !{metadata !"output", null}<br>!1 = metadata !{metadata !"input1", null}<br>!2 = metadata !{metadata !"input2", null}<br>
<br>should be<br><br>!0 = metadata !{ }<br>!1 = metadata !{ metadata !"output", metadata !0 }<br>!2 = metadata !{ metadata !"input1", metadata !0 }<br>!3 = metadata !{ metadata !"input2", metadata !0 }<br>
<br>with the corresponding renaming of nodes.<br><br>With this metadata, opt -O3 successfully pull store out of the loop:<br><br>; ModuleID = 'check.ll'<br>target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"<br>
target triple = "nvptx64-unknown-unknown"<br><br>@__kernelgen_version = constant [15 x i8] c"0.2/1654:1675M\00"<br><br>define ptx_kernel void @__kernelgen_matvec_loop_7(i32* nocapture) nounwind alwaysinline {<br>
"Loop Function Root":<br> %tid.x = tail call ptx_device i32 @llvm.nvvm.read.ptx.sreg.tid.x()<br> %ctaid.x = tail call ptx_device i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()<br> %PositionOfBlockInGrid.x = shl i32 %ctaid.x, 9<br>
%BlockLB.Add.ThreadPosInBlock.x = add i32 %PositionOfBlockInGrid.x, %tid.x<br> %isThreadLBgtLoopUB.x = icmp sgt i32 %BlockLB.Add.ThreadPosInBlock.x, 65535<br> br i1 %isThreadLBgtLoopUB.x, label %CUDA.AfterLoop.x, label %CUDA.LoopHeader.x.preheader<br>
<br>CUDA.LoopHeader.x.preheader: ; preds = %"Loop Function Root"<br> %1 = sext i32 %BlockLB.Add.ThreadPosInBlock.x to i64<br> store float 0.000000e+00, float* inttoptr (i64 47380979712 to float*), align 8192, !tbaa !0<br>
%p_.moved.to.4.cloned = shl nsw i64 %1, 9<br> br label %polly.loop_body<br><br>CUDA.AfterLoop.x.loopexit: ; preds = %polly.loop_body<br> store float %p_8, float* inttoptr (i64 47380979712 to float*), align 8192<br>
br label %CUDA.AfterLoop.x<br><br>CUDA.AfterLoop.x: ; preds = %CUDA.AfterLoop.x.loopexit, %"Loop Function Root"<br> ret void<br><br>polly.loop_body: ; preds = %polly.loop_body, %CUDA.LoopHeader.x.preheader<br>
%_p_scalar_ = phi float [ 0.000000e+00, %CUDA.LoopHeader.x.preheader ], [ %p_8, %polly.loop_body ]<br> %polly.loopiv10 = phi i64 [ 0, %CUDA.LoopHeader.x.preheader ], [ %polly.next_loopiv, %polly.loop_body ]<br> %polly.next_loopiv = add i64 %polly.loopiv10, 1<br>
%p_ = add i64 %polly.loopiv10, %p_.moved.to.4.cloned<br> %p_newGEPInst9.cloned = getelementptr float* inttoptr (i64 47246749696 to float*), i64 %p_<br> %p_newGEPInst12.cloned = getelementptr float* inttoptr (i64 47380971520 to float*), i64 %polly.loopiv10<br>
%_p_scalar_5 = load float* %p_newGEPInst9.cloned, align 4, !tbaa !2<br> %_p_scalar_6 = load float* %p_newGEPInst12.cloned, align 4, !tbaa !3<br> %p_7 = fmul float %_p_scalar_5, %_p_scalar_6<br> %p_8 = fadd float %_p_scalar_, %p_7<br>
%exitcond = icmp eq i64 %polly.next_loopiv, 512<br> br i1 %exitcond, label %CUDA.AfterLoop.x.loopexit, label %polly.loop_body<br>}<br><br>declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() nounwind readnone<br><br>declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() nounwind readnone<br>
<br>!0 = metadata !{metadata !"output", metadata !1}<br>!1 = metadata !{}<br>!2 = metadata !{metadata !"input1", metadata !1}<br>!3 = metadata !{metadata !"input2", metadata !1}<br><br><div class="gmail_quote">
2013/3/11 Dmitry Mikushin <span dir="ltr"><<a href="mailto:dmitry@kernelgen.org" target="_blank">dmitry@kernelgen.org</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I tried to manually assign each of 3 arrays a unique TBAA node. But it does not seem to help: alias analysis still considers arrays as may-alias, which most likely prevents the desired optimization. Below is the sample code with TBAA metadata inserted. Could you please suggest what might be wrong with it?<br>
<br>Many thanks,<br>- D.<br><br>marcusmae@M17xR4:~/forge/llvm$ opt -time-passes -enable-tbaa -tbaa -print-alias-sets -O3 check.ll -o - -S<br>Alias Set Tracker: 1 alias sets for 3 pointer values.<br> AliasSet[0x39046c0, 3] may alias, Mod/Ref Pointers: (float* inttoptr (i64 47380979712 to float*), 4), (float* %p_newGEPInst9.cloned, 4), (float* %p_newGEPInst12.cloned, 4)<br>
<br>; ModuleID = 'check.ll'<br>target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"<br>target triple = "nvptx64-unknown-unknown"<br>
<br>@__kernelgen_version = constant [15 x i8] c"0.2/1654:1675M\00"<br><br>define ptx_kernel void @__kernelgen_matvec_loop_7(i32* nocapture) #0 {<br>"Loop Function Root":<br> %tid.x = tail call ptx_device i32 @llvm.nvvm.read.ptx.sreg.tid.x()<br>
%ctaid.x = tail call ptx_device i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()<br> %PositionOfBlockInGrid.x = shl i32 %ctaid.x, 9<br> %BlockLB.Add.ThreadPosInBlock.x = add i32 %PositionOfBlockInGrid.x, %tid.x<br> %isThreadLBgtLoopUB.x = icmp sgt i32 %BlockLB.Add.ThreadPosInBlock.x, 65535<br>
br i1 %isThreadLBgtLoopUB.x, label %CUDA.AfterLoop.x, label %CUDA.LoopHeader.x.preheader<br><br>CUDA.LoopHeader.x.preheader: ; preds = %"Loop Function Root"<br> %1 = sext i32 %BlockLB.Add.ThreadPosInBlock.x to i64<br>
store float 0.000000e+00, float* inttoptr (i64 47380979712 to float*), align 8192, !tbaa !0<br> %p_.moved.to.4.cloned = shl nsw i64 %1, 9<br> br label %polly.loop_body<br><br>CUDA.AfterLoop.x: ; preds = %polly.loop_body, %"Loop Function Root"<br>
ret void<br><br>polly.loop_body: ; preds = %polly.loop_body, %CUDA.LoopHeader.x.preheader<br> %_p_scalar_ = phi float [ 0.000000e+00, %CUDA.LoopHeader.x.preheader ], [ %p_8, %polly.loop_body ]<br>
%polly.loopiv10 = phi i64 [ 0, %CUDA.LoopHeader.x.preheader ], [ %polly.next_loopiv, %polly.loop_body ]<br> %polly.next_loopiv = add i64 %polly.loopiv10, 1<br> %p_ = add i64 %polly.loopiv10, %p_.moved.to.4.cloned<br>
%p_newGEPInst9.cloned = getelementptr float* inttoptr (i64 47246749696 to float*), i64 %p_<br>
%p_newGEPInst12.cloned = getelementptr float* inttoptr (i64 47380971520 to float*), i64 %polly.loopiv10<br> %_p_scalar_5 = load float* %p_newGEPInst9.cloned, align 4, !tbaa !1<br> %_p_scalar_6 = load float* %p_newGEPInst12.cloned, align 4, !tbaa !2<br>
%p_7 = fmul float %_p_scalar_5, %_p_scalar_6<br> %p_8 = fadd float %_p_scalar_, %p_7<br> store float %p_8, float* inttoptr (i64 47380979712 to float*), align 8192, !tbaa !0<br> %exitcond = icmp eq i64 %polly.next_loopiv, 512<br>
br i1 %exitcond, label %CUDA.AfterLoop.x, label %polly.loop_body<br>}<br><br>declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #1<br><br>declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #1<br><br>attributes #0 = { alwaysinline nounwind }<br>
attributes #1 = { nounwind readnone }<br><br>!0 = metadata !{metadata !"output", null}<br>!1 = metadata !{metadata !"input1", null}<br>!2 = metadata !{metadata !"input2", null}<br>===-------------------------------------------------------------------------===<br>
... Pass execution timing report ...<br>===-------------------------------------------------------------------------===<br> Total Execution Time: 0.0080 seconds (0.0082 wall clock)<br><br> ---User Time--- --User+System-- ---Wall Time--- --- Name ---<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0020 ( 24.5%) Print module to stderr<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0006 ( 7.9%) Induction Variable Simplification<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0006 ( 7.7%) Combine redundant instructions<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0004 ( 5.2%) Combine redundant instructions<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0004 ( 5.1%) Alias Set Printer<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0003 ( 3.8%) Combine redundant instructions<br>
0.0040 ( 50.0%) 0.0040 ( 50.0%) 0.0003 ( 3.8%) Combine redundant instructions<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0003 ( 3.8%) Global Value Numbering<br> 0.0040 ( 50.0%) 0.0040 ( 50.0%) 0.0003 ( 3.7%) Combine redundant instructions<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0002 ( 2.9%) Early CSE<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0002 ( 2.0%) Reassociate expressions<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.7%) Early CSE<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.6%) Natural Loop Information<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.6%) Interprocedural Sparse Conditional Constant Propagation<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.4%) Loop Invariant Code Motion<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.4%) Module Verifier<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.2%) Simplify the CFG<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.1%) Value Propagation<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.0%) Sparse Conditional Constant Propagation<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.0%) Canonicalize natural loops<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 1.0%) Dead Store Elimination<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.9%) Module Verifier<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.8%) Value Propagation<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.8%) Simplify the CFG<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.7%) Deduce function attributes<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.7%) Remove unused exception handling info<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.6%) Simplify the CFG<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0001 ( 0.6%) Jump Threading<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.6%) Simplify the CFG<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.6%) Simplify the CFG<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.6%) Dominator Tree Construction<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.6%) Function Integration/Inlining<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.5%) Jump Threading<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.5%) Canonicalize natural loops<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.5%) Unswitch loops<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.4%) MemCpy Optimization<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.4%) Dominator Tree Construction<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.4%) Loop-Closed SSA Form Pass<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Recognize loop idioms<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Dominator Tree Construction<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Scalar Evolution Analysis<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Dominator Tree Construction<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Basic CallGraph Construction<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Dominator Tree Construction<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Dominator Tree Construction<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Unroll loops<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Aggressive Dead Code Elimination<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Global Variable Optimizer<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.3%) Loop-Closed SSA Form Pass<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Loop-Closed SSA Form Pass<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Inline Cost Analysis<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Tail Call Elimination<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Lazy Value Information Analysis<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Lazy Value Information Analysis<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Dead Argument Elimination<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.2%) Dead Global Elimination<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) No target information<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Target independent code generator's TTI<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Merge Duplicate Global Constants<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Simplify well-known library calls<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Memory Dependence Analysis<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Delete dead loops<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) SROA<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Memory Dependence Analysis<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Basic Alias Analysis (stateless AA impl)<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) SROA<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Memory Dependence Analysis<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Lower 'expect' Intrinsics<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Rotate Loops<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Promote 'by reference' arguments to scalars<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) Preliminary module verification<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.1%) No Alias Analysis (always returns 'may' alias)<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) No target information<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Target Library Information<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Strip Unused Function Prototypes<br>
0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) No Alias Analysis (always returns 'may' alias)<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Type-Based Alias Analysis<br> 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Preliminary module verification<br>
0.0080 (100.0%) 0.0080 (100.0%) 0.0082 (100.0%) Total<div class="HOEnZb"><div class="h5"><br><br><div class="gmail_quote">2013/3/11 Dmitry Mikushin <span dir="ltr"><<a href="mailto:dmitry@kernelgen.org" target="_blank">dmitry@kernelgen.org</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Dear all,<br><br>Attached notunrolled.ll is a module containing reduction kernel. What I'm trying to do is to unroll it in such way, that partial reduction on unrolled iterations would be performed on register, and then stored to memory only once. Currently llvm's unroller together with all standard optimizations produce code, which stores value to memory after every unrolled iteration, which is much less efficient. Do you have an idea which combination of opt passes may help to cache unrolled loop stores on a register?<br>
<br>Many thanks,<br>- D.<br>
</blockquote></div><br>
</div></div></blockquote></div><br>