<div dir="ltr">When you generate axpy-host.bc, you should use "clang -cc1 ..." with the "-fcuda-include-gpubinary" flag. "clang -cc1" invokes the frontend only. </div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 15, 2016 at 6:45 PM, Yuanfeng Peng <span dir="ltr"><<a href="mailto:yuanfeng.jack.peng@gmail.com" target="_blank">yuanfeng.jack.peng@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Jingyue,<div><br></div><div>Sorry to ask again, but how exactly could I glue the fatbin with the instrumented host code? Or does it mean we actually cannot instrument both the host & device code at the same time?</div><div><br></div><div>Thanks!</div><span class="HOEnZb"><font color="#888888"><div>yuanfeng</div></font></span></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 15, 2016 at 10:09 AM, Jingyue Wu <span dir="ltr"><<a href="mailto:jingyue@google.com" target="_blank">jingyue@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><span style="font-size:12.8px">Including fatbin into host code should be done in frontend. </span></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 14, 2016 at 12:13 AM, Yuanfeng Peng <span dir="ltr"><<a href="mailto:yuanfeng.jack.peng@gmail.com" target="_blank">yuanfeng.jack.peng@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hey Jingyue,<div><br></div><div>Thanks for being so responsive! I finally figured out a way to resolve the issue: all I have to do is to use `-only-needed` when merging the device bitcodes with llvm-link.</div><div><br></div><div>However, since we actually need to instrument the host code as well, I encountered another issue when I tried to glue the instrumented host code and fatbin together. When I only instrumented the device code, I used the following cmd to do so:</div><div><br></div><div>
<p><span>"/mnt/wtf/tools/bin/clang-3.9" "-cc1" "-triple" "x86_64-unknown-linux-gnu" "-aux-triple" "nvptx64-nvidia-cuda" "-fcuda-target-overloads" "-fcuda-disable-target-call-checks" "-emit-obj" "-disable-free" "-main-file-name" "<a href="http://axpy.cu" target="_blank">axpy.cu</a>" "-mrelocation-model" "static" "-mthread-model" "posix" "-fmath-errno" "-masm-verbose" "-mconstructor-aliases" "-munwind-tables" "-fuse-init-array" "-target-cpu" "x86-64" "-momit-leaf-frame-pointer" "-dwarf-column-info" "-debugger-tuning=gdb" "-resource-dir" "/mnt/wtf/tools/bin/../lib/clang/3.9.0" "-I" "/usr/local/cuda-7.0/samples/common/inc" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/x86_64-linux-gnu/c++/4.8" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/backward" "-internal-isystem" "/usr/local/include" "-internal-isystem" "/mnt/wtf/tools/bin/../lib/clang/3.9.0/include" "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-internal-isystem" "/usr/local/cuda/include" "-include" "__clang_cuda_runtime_wrapper.h" "-O3" "-fdeprecated-macro" "-fdebug-compilation-dir" "/mnt/wtf/workspace/cuda/gpu-race-detection" "-ferror-limit" "19" "-fmessage-length" "291" "-pthread" "-fobjc-runtime=gcc" "-fcxx-exceptions" "-fexceptions" "-fdiagnostics-show-option" "-vectorize-loops" "-vectorize-slp" "-o" "axpy-host.o" "-x" "cuda" "tests/<a href="http://axpy.cu" target="_blank">axpy.cu</a>" "-fcuda-include-gpubinary" "axpy-sm_30.fatbin"</span></p>
<p><span></span>which, from my understanding, compiles the host code in tests/<a href="http://axpy.cu" target="_blank">axpy.cu</a> and link it with axpy-sm_30.fatbin. However, now that I instrumented the IR of the host code (axpy.bc) and did `llc axpy.bc -o axpy.s`, which cmd should I use to link axpy.s with axpy-sm_30.fatbin? I tried to use -cc1as, but the flag '-fcuda-include-gpubinary' was not recognized.</p><p>Thanks!</p><span><font color="#888888"><p>yuanfeng</p></font></span></div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Mar 12, 2016 at 12:05 AM, Jingyue Wu <span dir="ltr"><<a href="mailto:jingyue@google.com" target="_blank">jingyue@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I've no idea. Without instrumentation, nvvm_reflect_anchor doesn't appear in the final PTX, right? If that's the case, some pass in llc must have deleted the anchor and you should be able to figure out which one. </div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 11, 2016 at 4:56 PM, Yuanfeng Peng <span dir="ltr"><<a href="mailto:yuanfeng.jack.peng@gmail.com" target="_blank">yuanfeng.jack.peng@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hey Jingyue,<div><br></div><div>Though I tried `opt -nvvm-reflect` on both bc files, the nvvm reflect anchor didn't go away; ptxas is still complaining about the duplicate definition of of function '_ZL21__nvvm_reflect_anchorv' . Did I misused the nvvm-reflect pass?</div><div><br></div><div>Thanks!</div><span><font color="#888888"><div>yuanfeng</div>
</font></span></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 11, 2016 at 10:10 AM, Jingyue Wu <span dir="ltr"><<a href="mailto:jingyue@google.com" target="_blank">jingyue@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">According to the examples you sent, I believe the linking issue was caused by nvvm reflection anchors. I haven't played with that, but I guess running nvvm-reflect on an IR removes the nvvm reflect anchors. After that, you can llvm-link the two bc/ll files. <div><br></div><div>Another potential issue is that your cuda_hooks-sm_30.ll is unoptimized. This could cause the instrumented code to run super slow. </div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 11, 2016 at 9:40 AM, Yuanfeng Peng <span dir="ltr"><<a href="mailto:yuanfeng.jack.peng@gmail.com" target="_blank">yuanfeng.jack.peng@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hey Jingyue,<div><br></div><div>Attached are the .ll files. Thanks!</div><span><font color="#888888"><div><br></div><div>yuanfeng</div></font></span></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 11, 2016 at 3:47 AM, Jingyue Wu <span dir="ltr"><<a href="mailto:jingyue@google.com" target="_blank">jingyue@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Looks like we are getting closer! <br><div class="gmail_extra"><br><div class="gmail_quote"><span>On Thu, Mar 10, 2016 at 5:21 PM, Yuanfeng Peng <span dir="ltr"><<a href="mailto:yuanfeng.jack.peng@gmail.com" target="_blank">yuanfeng.jack.peng@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra">Hi Jingyue,</div><div class="gmail_extra"><br></div><div class="gmail_extra">Thank you so much for the helpful response! I didn't know that PTX assembly cannot be linked; that's likely the reason for my issue.</div><div class="gmail_extra"><br></div><div class="gmail_extra">So I did the following as you suggested(axpy-sm_30.bc is the instrumented bitcode, and cuda_hooks-sm_30.bc contains the hook functions):</div><div class="gmail_extra"><br></div><div class="gmail_extra">
<p><i><b><span>llvm-link</span><span> axpy-sm_30.bc cuda_hooks-sm_30.bc -o inst_axpy-sm_30.bc</span></b></i></p><p><span><i><b>llc inst_axpy-sm_30.bc -o axpy-sm_30.s</b></i></span></p><p><span><i><b>
</b></i></span></p><p><span><i><b>"/usr/local/cuda/bin/ptxas" "-m64" "-O3" -c "--gpu-name" "sm_30" "--output-file" axpy-sm_30.o axpy-sm_30.s</b></i></span></p><p>However, I got the following error from ptxas:</p><p><span><b>ptxas axpy-sm_30.s, line 106; error : Duplicate definition of function '_ZL21__nvvm_reflect_anchorv'</b></span></p><p><span><b>ptxas axpy-sm_30.s, line 106; fatal : Parsing error near '.2': syntax error</b></span></p><p>
</p><p><span><b>ptxas fatal : Ptx assembly aborted due to errors</b></span></p><p>Looks like some cuda function definitions are in both bitcode files which caused duplicate definition... what am I supposed to do to resolve this issue?</p></div></div></blockquote></span><div>Can you attach axpy-sm_30.ll and cuda_hooks-sm_30.ll? The duplication may be caused by how nvvm reflection works, but I'd like to see a concrete example. <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><p><br></p><p>Thanks!</p><span><font color="#888888"><p>yuanfeng </p><p><span><b><br></b></span></p></font></span></div><div class="gmail_extra"><br><br></div></div>
</blockquote></div><br></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>