Dear all,<br><br>FWIW, I've tested libdevice.compute_20.10.bc and libdevice.compute_30.10.bc from /cuda/nvvm/libdevice shipped with CUDA 5.5 preview. IR is compatible with LLVM 3.4 trunk that we use. Results are correct, performance - almost the same as what we had before with cicc-sniffed IR, or maybe <10% better. Will test libdevice.compute_35.10.bc once we will get K20 support.<br>
<br>Thanks for addressing this,<br>- D.<br><br><div class="gmail_quote">2013/2/17 Dmitry Mikushin <span dir="ltr"><<a href="mailto:dmitry@kernelgen.org" target="_blank">dmitry@kernelgen.org</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="im">> The issue is really that there is no standard math library for PTX.<br><br></div>Well, formally, that could very well be true. Moreover, in some parts CPU math standard is impossible to accomplish on parallel architectures, consider, for example errno behavior. But here we are speaking more about practical side. And the practical side is: past 5 years CUDA claims to accelerate compute applications, and it implies having good math support. For clearance, we can drop term "LLVM CUDA math library" and instead speak of the need to have for entire LLVM "the same degree of math support" CUDA currently has for C/C++.<br>
<br>If you think having math module outside of backend is more feasible, this is also a way to go, but please see what we need in this case in the first email: anyways, NVPTX backend will have to tell us, which intrinsics he is going to lower, and which ones will make him to crash. So, there is need to modify something in the backend, anyways.<div class="HOEnZb">
<div class="h5"><br>
<br>- D.<br><br><div class="gmail_quote">2013/2/17 Justin Holewinski <span dir="ltr"><<a href="mailto:justin.holewinski@gmail.com" target="_blank">justin.holewinski@gmail.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">The X86 back-end just calls into libm:<div><br></div><div><pre style="white-space:pre-wrap;word-wrap:break-word"> // Always use a library call for pow.
setOperationAction(ISD::FPOW , MVT::f32 , Expand);
setOperationAction(ISD::FPOW , MVT::f64 , Expand);
setOperationAction(ISD::FPOW , MVT::f80 , Expand);</pre><pre style="white-space:pre-wrap;word-wrap:break-word"><br></pre>The issue is really that there is no standard math library for PTX. I agree that this is a pain for most users, but I don't think the right solution is to embed a whole suite of math functions into the back-end. All I'm suggesting is that we instead follow the path of linking in an external math library of target-specific functions. Whether you link your IR with a bitcode library before codegen or have codegen emit library function calls is an implementation detail, with each having advantages. The accuracy modes can be used to pick the proper library function in the latter case, but I still think library function choice is better left up to the front-end, and the accuracy attributes are a better fit to drive optimization.</div>
</div><div class="gmail_extra"><div><div><br><br><div class="gmail_quote">On Sun, Feb 17, 2013 at 9:48 AM, Dmitry Mikushin <span dir="ltr"><<a href="mailto:dmitry@kernelgen.org" target="_blank">dmitry@kernelgen.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Justin,<br><br>I don't understand, why, for instance, X86 backend handles pow automatically, and NVPTX should be a PITA requiring user to bring his own pow implementation. Even at a very general level, this limits the interest of users to LLVM NVPTX backend. Could you please elaborate on the rationale behind your point? Why the accuracy modes I suggested are not sufficient, in your opinion?<span><font color="#888888"><br>
<br>- D.</font></span><div><div><br><br><div class="gmail_quote">2013/2/17 Justin Holewinski <span dir="ltr"><<a href="mailto:justin.holewinski@gmail.com" target="_blank">justin.holewinski@gmail.com</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<p dir="ltr">I would be very hesitant to expose all math library functions as intrinsics. I believe linking with a target-specific math library is the correct approach, as it decouples the back end from the needs of the source program/language. Users should be free to use any math library implementation they choose. Intrinsics are meant for functions that compile down to specific isa features, like fused multiply add and square root.</p>
<div><div>
<div class="gmail_quote">On Feb 16, 2013 8:46 PM, "Dmitry Mikushin" <<a href="mailto:dmitry@kernelgen.org" target="_blank">dmitry@kernelgen.org</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Dear Yuan,<br><br>Sorry for delay with reply,<br><br>Answers on your questions could be different, depending on the math library placement in the code generation pipeline. At KernelGen, we currently have a user-level CUDA math module, adopted from cicc internals [1]. It is intended to be linked with the user LLVM IR module, right before proceeding with the final optimization and backend. Last few months we are using this method to temporary workaround the absence of many math functions, to keep up the speed of applications testing in our compiler test suite. Supplying math in such way is not portable and introduces many issues, for instance:<br>
1) The frontend (DragonEgg - in our case) must be taught to emit real math functions calls instead those of LLVM intrinsics, NVPTX cannot handle<br>2) However, not all intrinsics should be replaced by math calls directly, for example, there is not cdexp call, but it could be modelled with sincos.<br>
3) Our math module assumes sm_20, and could be inefficient or non-portable on other families of GPUs.<br><br>Instead of this approach, I think math library should be implemented <u>as a lowering pass in backend</u>, working directly with intrinsics. In this case - naming is not important, as well as final optimization is the job of backend. But there is another important thing: backend should codegen math with respect to accuracy settings, specified either as backend options, or as functions attributes (quiet recent addition of LLVM). Accuracy settings should be:<br>
1) fast-math (ftz, prec-div, prec-sqrt, fma, etc.)<br>2) Use or not GPU-specific low-precision functions (__sin, __cos, etc.)<br><br>Following latter approach, math handling of NVPTX will conform the rest of LLVM, and no host-dependant tweaks will be needed.<br>
<br>I'm also interested to contribute into this developments at reasonable depth. Moving this part only on our own would slow down the progess with main targets too much, that's why I'm asking for your help and cooperation.<br>
<br>Best regards,<br>- Dima.<br><br>[1] <a href="https://hpcforge.org/scm/viewvc.php/*checkout*/trunk/src/cuda/include/math.bc?root=kernelgen" target="_blank">https://hpcforge.org/scm/viewvc.php/*checkout*/trunk/src/cuda/include/math.bc?root=kernelgen</a><br>
<br><div class="gmail_quote">2013/2/8 Yuan Lin <span dir="ltr"><<a href="mailto:yulin@nvidia.com" target="_blank">yulin@nvidia.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div link="blue" vlink="purple" lang="EN-US"><div><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Yes, it helps a lot and we are working on it.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">A few questions,<u></u><u></u></span></p>
<p><u></u><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><span>1)<span style="font:7.0pt "Times New Roman""> </span></span></span><u></u><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">What will be your use model of this library? Will you run optimization phases after linking with the library? If so, what are they?<u></u><u></u></span></p>
<p><u></u><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><span>2)<span style="font:7.0pt "Times New Roman""> </span></span></span><u></u><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Do you care if the names of functions differ from those in libm? For example, it would be gpusin() instead of sin(). <u></u><u></u></span></p>
<p><u></u><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><span>3)<span style="font:7.0pt "Times New Roman""> </span></span></span><u></u><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Do you need a different library for different host platforms? Why?<u></u><u></u></span></p>
<p><u></u><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><span>4)<span style="font:7.0pt "Times New Roman""> </span></span></span><u></u><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Any other functions (besides math) you want to see in this library?<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Thanks.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Yuan<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p><p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">From:</span></b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif""> Dmitry Mikushin [mailto:<a href="mailto:dmitry@kernelgen.org" target="_blank">dmitry@kernelgen.org</a>] <br>
<b>Sent:</b> Thursday, February 07, 2013 2:09 PM<br><b>To:</b> Justin Holewinski; LLVM Developers Mailing List<br><b>Cc:</b> Yuan Lin<br><b>Subject:</b> [NVPTX] We need an LLVM CUDA math library, after all<u></u><u></u></span></p>
<div><div><p class="MsoNormal"><u></u> <u></u></p><p class="MsoNormal">Hi Justin, gentlemen,<br><br>I'm afraid I have to escalate this issue at this point. Since it was discussed for the first time last summer, it was sufficient for us for a while to have lowering of math calls into intrinsics disabled at DragonEgg level, and link them against CUDA math functions at LLVM IR level. Now I can say: this is not sufficient any longer, and we need NVPTX backend to deal with GPU math.<br>
<br>> There also is no standard libm for PTX.<br><br>Yes, that's right, but there is an interesting idea to codegen CUDA math headers into LLVM IR and link it with user module at IR level. This method gives a perfect degree of flexibility with respect to high-level languages: the user no longer needs to deal with headers and can have math right in the IR, regardless the language it was lowered from. I can confirm this method works for us very well with C and Fortran, but in order to make accurate replacements of unsupported intrinsics calls, it needs to become aware of NVPTX backend capabilities in the form of:<br>
<br>bool NVPTXTargetMachine::<u></u><u></u></p><div><p class="MsoNormal">isIntrinsicSupported(Function& intrinsic) and<br>string NVPTXTargetMachine::whichMathCallReplacesIntrinsic(Function& intrinsic)<br><br>> I would prefer not to lower such things in the back-end since different compilers may want to implement such functions differently based on speed vs. accuracy trade-offs.<br>
<br>Who are those different compilers? We are LLVM, the complete compiler stack, which should handle these things on its specific preference. Derived compilers may certainly think different, and it's their own business to change anything they want and never contribute back. We should not forget there are a lot of derived projects that use LLVM directly, like KernelGen or many of those embedded DSLs recently started flourishing. Their completeness and future relies on LLVM. For these reasons, I would strongly prefer LLVM/NVPTX should supply a reference GPU math implementation and invite you and everyone else to form a joint roadmap to deliver it.<br>
<br>Before we started, IANAL, but something tells me there could be a licensing issue about releasing the LLVM IR emitted from CUDA headers.<br>Could you please check this with NVIDIA?<br><br>Many thanks,<br>- D.<br><br>
2012/9/6 Justin Holewinski <<a href="mailto:justin.holewinski@gmail.com" target="_blank">justin.holewinski@gmail.com</a>>:<br>
> On 09/06/2012 10:02 AM, Dmitry N. Mikushin wrote:<br>>><br>>> Dear all,<br>>><br>>> During app compilation we have a crash in NVPTX backend:<br>>><br>>> LLVM ERROR: Cannot select: 0x732b270: i64 = ExternalSymbol'__powisf2'<br>
>> [ID=18]<br>>><br>>> As I understand LLVM tries to lower the following call<br>>><br>>> %28 = call ptx_device float @llvm.powi.f32(float 2.000000e+00, i32 %8)<br>>> nounwind readonly<br>
>><br>>> to device intrinsic. The table llvm/IntrinsicsNVVM.td does not contain<br>>> such intrinsic, however it should be builtin, according to<br>>> cuda/include/math_functions.h<br>><br>><br>
> It actually gets lowered into an external function call.<br>><br>><br>>><br>>> Is my understanding correct, and we need simply add the corresponding<br>>> definition to llvm/IntrinsicsNVVM.td ? How to do that, what are the<br>
>> rules?<br>><br>><br>> PTX does not have an instruction (or simple series of instructions) that<br>> implements pow, so this will not be handled. I would prefer not to lower<br>> such things in the back-end since different compilers may want to implement<br>
> such functions differently based on speed vs. accuracy trade-offs.<br>><br>> There also is no standard libm for PTX. It is up to the higher-level<br>> compiler to link against a run-time library that provides functions like pow<br>
> (see include/math_functions.h in a CUDA distribution).<br>><br>>><br>>> Thanks,<br>>> - D.<br>>> _______________________________________________<br>>> LLVM Developers mailing list<br>
>> <a href="mailto:LLVMdev@cs.uiuc.edu" target="_blank">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>>> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>
><u></u><u></u></p><div><div><p class="MsoNormal"><img src="" border="0"><u></u><u></u></p></div></div><p class="MsoNormal"><span><span style="color:#888888">></span></span><span style="color:#888888"><br><span>> --</span><br>
<span>> Thanks,</span><br><span>></span><br><span>> Justin Holewinski</span><br><span>></span></span><u></u><u></u></p></div></div></div></div>
<div>
<hr>
</div>
<div>This email message is for the sole use of the intended recipient(s) and may
contain confidential information. Any unauthorized review, use, disclosure
or distribution is prohibited. If you are not the intended recipient,
please contact the sender by reply email and destroy all copies of the original
message. </div>
<div>
<hr>
</div>
<p></p>
</div>
</blockquote></div><br>
</blockquote></div>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div></div></div><span><font color="#888888">-- <br><br><div>Thanks,</div><div><br></div><div>Justin Holewinski</div>
</font></span></div>
</blockquote></div><br>
</div></div></blockquote></div><br>