[LLVMdev] [NVPTX] We need an LLVM CUDA math library, after all

Wed Jun 5 12:06:30 PDT 2013

> Also, I've been meaning to address your -drvcuda issue.  How would you
feel about making that a part of the triple?

Thanks, as you like. Currently I'm using target attribute for that, and
showed you that patch.

- D.

2013/6/5 Justin Holewinski <justin.holewinski at gmail.com>

> Thanks for the info!  I would be glad to hear of any issues you have
> encountered on this path.  I tried to make sure the 3.3 release was fully
> compatible with the libdevice implementation shipping with 5.5 (and as far
> as I know, it is).  It's just not an officially supported configuration.
>
> Also, I've been meaning to address your -drvcuda issue.  How would you
> feel about making that a part of the triple?
>
>
> On Wed, Jun 5, 2013 at 5:10 AM, Dmitry Mikushin <dmitry at kernelgen.org>wrote:
>
>> Dear all,
>>
>> FWIW, I've tested libdevice.compute_20.10.bc and
>> libdevice.compute_30.10.bc from /cuda/nvvm/libdevice shipped with CUDA 5.5
>> preview. IR is compatible with LLVM 3.4 trunk that we use. Results are
>> correct, performance - almost the same as what we had before with
>> cicc-sniffed IR, or maybe <10% better. Will test libdevice.compute_35.10.bc
>> once we will get K20 support.
>>
>> Thanks for addressing this,
>> - D.
>>
>>
>> 2013/2/17 Dmitry Mikushin <dmitry at kernelgen.org>
>>
>>> > The issue is really that there is no standard math library for PTX.
>>>
>>> Well, formally, that could very well be true. Moreover, in some parts
>>> CPU math standard is impossible to accomplish on parallel architectures,
>>> consider, for example errno behavior. But here we are speaking more about
>>> practical side. And the practical side is: past 5 years CUDA claims to
>>> accelerate compute applications, and it implies having good math support.
>>> For clearance, we can drop term "LLVM CUDA math library" and instead speak
>>> of the need to have for entire LLVM "the same degree of math support" CUDA
>>> currently has for C/C++.
>>>
>>> If you think having math module outside of backend is more feasible,
>>> this is also a way to go, but please see what we need in this case in the
>>> first email: anyways, NVPTX backend will have to tell us, which intrinsics
>>> he is going to lower, and which ones will make him to crash. So, there is
>>> need to modify something in the backend, anyways.
>>>
>>>
>>> - D.
>>>
>>> 2013/2/17 Justin Holewinski <justin.holewinski at gmail.com>
>>>
>>>> The X86 back-end just calls into libm:
>>>>
>>>>   // Always use a library call for pow.
>>>>   setOperationAction(ISD::FPOW             , MVT::f32  , Expand);
>>>>   setOperationAction(ISD::FPOW             , MVT::f64  , Expand);
>>>>   setOperationAction(ISD::FPOW             , MVT::f80  , Expand);
>>>>
>>>>
>>>> The issue is really that there is no standard math library for PTX.  I
>>>> agree that this is a pain for most users, but I don't think the right
>>>> solution is to embed a whole suite of math functions into the back-end.
>>>>  All I'm suggesting is that we instead follow the path of linking in an
>>>> external math library of target-specific functions.  Whether you link your
>>>> IR with a bitcode library before codegen or have codegen emit library
>>>> function calls is an implementation detail, with each having advantages.
>>>>  The accuracy modes can be used to pick the proper library function in the
>>>> latter case, but I still think library function choice is better left up to
>>>> the front-end, and the accuracy attributes are a better fit to drive
>>>> optimization.
>>>>
>>>>
>>>> On Sun, Feb 17, 2013 at 9:48 AM, Dmitry Mikushin <dmitry at kernelgen.org>wrote:
>>>>
>>>>> Hi Justin,
>>>>>
>>>>> I don't understand, why, for instance, X86 backend handles pow
>>>>> automatically, and NVPTX should be a PITA requiring user to bring his own
>>>>> pow implementation. Even at a very general level, this limits the interest
>>>>> of users to LLVM NVPTX backend. Could you please elaborate on the rationale
>>>>> behind your point? Why the accuracy modes I suggested are not sufficient,
>>>>> in your opinion?
>>>>>
>>>>> - D.
>>>>>
>>>>>
>>>>> 2013/2/17 Justin Holewinski <justin.holewinski at gmail.com>
>>>>>
>>>>>> I would be very hesitant to expose all math library functions as
>>>>>> intrinsics.  I believe linking with a target-specific math library is the
>>>>>> correct approach, as it decouples the back end from the needs of the source
>>>>>> program/language.  Users should be free to use any math library
>>>>>> implementation they choose.  Intrinsics are meant for functions that
>>>>>> compile down to specific isa features, like fused multiply add and square
>>>>>> root.
>>>>>>  On Feb 16, 2013 8:46 PM, "Dmitry Mikushin" <dmitry at kernelgen.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Dear Yuan,
>>>>>>>
>>>>>>> Sorry for delay with reply,
>>>>>>>
>>>>>>> Answers on your questions could be different, depending on the math
>>>>>>> library placement in the code generation pipeline. At KernelGen, we
>>>>>>> currently have a user-level CUDA math module, adopted from cicc internals
>>>>>>> [1]. It is intended to be linked with the user LLVM IR module, right before
>>>>>>> proceeding with the final optimization and backend. Last few months we are
>>>>>>> using this method to temporary workaround the absence of many math
>>>>>>> functions, to keep up the speed of applications testing in our compiler
>>>>>>> test suite. Supplying math in such way is not portable and introduces many
>>>>>>> issues, for instance:
>>>>>>> 1) The frontend (DragonEgg - in our case) must be taught to emit
>>>>>>> real math functions calls instead those of LLVM intrinsics, NVPTX cannot
>>>>>>> handle
>>>>>>> 2) However, not all intrinsics should be replaced by math calls
>>>>>>> directly, for example, there is not cdexp call, but it could be modelled
>>>>>>> with sincos.
>>>>>>> 3) Our math module assumes sm_20, and could be inefficient or
>>>>>>> non-portable on other families of GPUs.
>>>>>>>
>>>>>>> Instead of this approach, I think math library should be implemented
>>>>>>> *as a lowering pass in backend*, working directly with intrinsics.
>>>>>>> In this case - naming is not important, as well as final optimization is
>>>>>>> the job of backend. But there is another important thing: backend should
>>>>>>> codegen math with respect to accuracy settings, specified either as backend
>>>>>>> options, or as functions attributes (quiet recent addition of LLVM).
>>>>>>> Accuracy settings should be:
>>>>>>> 1) fast-math (ftz, prec-div, prec-sqrt, fma, etc.)
>>>>>>> 2) Use or not GPU-specific low-precision functions (__sin, __cos,
>>>>>>> etc.)
>>>>>>>
>>>>>>> Following latter approach, math handling of NVPTX will conform the
>>>>>>> rest of LLVM, and no host-dependant tweaks will be needed.
>>>>>>>
>>>>>>> I'm also interested to contribute into this developments at
>>>>>>> reasonable depth. Moving this part only on our own would slow down the
>>>>>>> progess with main targets too much, that's why I'm asking for your help and
>>>>>>> cooperation.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> - Dima.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://hpcforge.org/scm/viewvc.php/*checkout*/trunk/src/cuda/include/math.bc?root=kernelgen
>>>>>>>
>>>>>>> 2013/2/8 Yuan Lin <yulin at nvidia.com>
>>>>>>>
>>>>>>>> Yes, it helps a lot and we are working on it.****
>>>>>>>>
>>>>>>>> ** **
>>>>>>>>
>>>>>>>> A few questions,****
>>>>>>>>
>>>>>>>> **1)      **What will be your use model of this library? Will you
>>>>>>>> run optimization phases after linking with the library? If so, what are
>>>>>>>> they?****
>>>>>>>>
>>>>>>>> **2)      **Do you care if the names of functions differ from
>>>>>>>> those in libm? For example, it would be gpusin() instead of sin().
>>>>>>>> ****
>>>>>>>>
>>>>>>>> **3)      **Do you need a different library for different host
>>>>>>>> platforms? Why?****
>>>>>>>>
>>>>>>>> **4)      **Any other functions (besides math) you want to see in
>>>>>>>> this library?****
>>>>>>>>
>>>>>>>> ** **
>>>>>>>>
>>>>>>>> Thanks.****
>>>>>>>>
>>>>>>>> ** **
>>>>>>>>
>>>>>>>> Yuan****
>>>>>>>>
>>>>>>>> ** **
>>>>>>>>
>>>>>>>> ** **
>>>>>>>>
>>>>>>>> *From:* Dmitry Mikushin [mailto:dmitry at kernelgen.org]
>>>>>>>> *Sent:* Thursday, February 07, 2013 2:09 PM
>>>>>>>> *To:* Justin Holewinski; LLVM Developers Mailing List
>>>>>>>> *Cc:* Yuan Lin
>>>>>>>> *Subject:* [NVPTX] We need an LLVM CUDA math library, after all****
>>>>>>>>
>>>>>>>> ** **
>>>>>>>>
>>>>>>>> Hi Justin, gentlemen,
>>>>>>>>
>>>>>>>> I'm afraid I have to escalate this issue at this point. Since it
>>>>>>>> was discussed for the first time last summer, it was sufficient for us for
>>>>>>>> a while to have lowering of math calls into intrinsics disabled at
>>>>>>>> DragonEgg level, and link them against CUDA math functions at LLVM IR
>>>>>>>> level. Now I can say: this is not sufficient any longer, and we need NVPTX
>>>>>>>> backend to deal with GPU math.
>>>>>>>>
>>>>>>>> > There also is no standard libm for PTX.
>>>>>>>>
>>>>>>>> Yes, that's right, but there is an interesting idea to codegen CUDA
>>>>>>>> math headers into LLVM IR and link it with user module at IR level. This
>>>>>>>> method gives a perfect degree of flexibility with respect to high-level
>>>>>>>> languages: the user no longer needs to deal with headers and can have math
>>>>>>>> right in the IR, regardless the language it was lowered from. I can confirm
>>>>>>>> this method works for us very well with C and Fortran, but in order to make
>>>>>>>> accurate replacements of unsupported intrinsics calls, it needs to become
>>>>>>>> aware of NVPTX backend capabilities in the form of:
>>>>>>>>
>>>>>>>> bool NVPTXTargetMachine::****
>>>>>>>>
>>>>>>>> isIntrinsicSupported(Function& intrinsic) and
>>>>>>>> string NVPTXTargetMachine::whichMathCallReplacesIntrinsic(Function&
>>>>>>>> intrinsic)
>>>>>>>>
>>>>>>>> > I would prefer not to lower such things in the back-end since
>>>>>>>> different compilers may want to implement such functions differently based
>>>>>>>> on speed vs. accuracy trade-offs.
>>>>>>>>
>>>>>>>> Who are those different compilers? We are LLVM, the complete
>>>>>>>> compiler stack, which should handle these things on its specific
>>>>>>>> preference. Derived compilers may certainly think different, and it's their
>>>>>>>> own business to change anything they want and never contribute back. We
>>>>>>>> should not forget there are a lot of derived projects that use LLVM
>>>>>>>> directly, like KernelGen or many of those embedded DSLs recently started
>>>>>>>> flourishing. Their completeness and future relies on LLVM. For these
>>>>>>>> reasons, I would strongly prefer LLVM/NVPTX should supply a reference GPU
>>>>>>>> math implementation and invite you and everyone else to form a joint
>>>>>>>> roadmap to deliver it.
>>>>>>>>
>>>>>>>> Before we started, IANAL, but something tells me there could be a
>>>>>>>> licensing issue about releasing the LLVM IR emitted from CUDA headers.
>>>>>>>> Could you please check this with NVIDIA?
>>>>>>>>
>>>>>>>> Many thanks,
>>>>>>>> - D.
>>>>>>>>
>>>>>>>> 2012/9/6 Justin Holewinski <justin.holewinski at gmail.com>:
>>>>>>>> > On 09/06/2012 10:02 AM, Dmitry N. Mikushin wrote:
>>>>>>>> >>
>>>>>>>> >> Dear all,
>>>>>>>> >>
>>>>>>>> >> During app compilation we have a crash in NVPTX backend:
>>>>>>>> >>
>>>>>>>> >> LLVM ERROR: Cannot select: 0x732b270: i64 =
>>>>>>>> ExternalSymbol'__powisf2'
>>>>>>>> >> [ID=18]
>>>>>>>> >>
>>>>>>>> >> As I understand LLVM tries to lower the following call
>>>>>>>> >>
>>>>>>>> >> %28 = call ptx_device float @llvm.powi.f32(float 2.000000e+00,
>>>>>>>> i32 %8)
>>>>>>>> >> nounwind readonly
>>>>>>>> >>
>>>>>>>> >> to device intrinsic. The table llvm/IntrinsicsNVVM.td does not
>>>>>>>> contain
>>>>>>>> >> such intrinsic, however it should be builtin, according to
>>>>>>>> >> cuda/include/math_functions.h
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > It actually gets lowered into an external function call.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >>
>>>>>>>> >> Is my understanding correct, and we need simply add the
>>>>>>>> corresponding
>>>>>>>> >> definition to llvm/IntrinsicsNVVM.td ? How to do that, what are
>>>>>>>> the
>>>>>>>> >> rules?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > PTX does not have an instruction (or simple series of
>>>>>>>> instructions) that
>>>>>>>> > implements pow, so this will not be handled.  I would prefer not
>>>>>>>> to lower
>>>>>>>> > such things in the back-end since different compilers may want to
>>>>>>>> implement
>>>>>>>> > such functions differently based on speed vs. accuracy trade-offs.
>>>>>>>> >
>>>>>>>> > There also is no standard libm for PTX.  It is up to the
>>>>>>>> higher-level
>>>>>>>> > compiler to link against a run-time library that provides
>>>>>>>> functions like pow
>>>>>>>> > (see include/math_functions.h in a CUDA distribution).
>>>>>>>> >
>>>>>>>> >>
>>>>>>>> >> Thanks,
>>>>>>>> >> - D.
>>>>>>>> >> _______________________________________________
>>>>>>>> >> LLVM Developers mailing list
>>>>>>>> >> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>>>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>>> >****
>>>>>>>>
>>>>>>>> ****
>>>>>>>>
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Thanks,
>>>>>>>> >
>>>>>>>> > Justin Holewinski
>>>>>>>> >****
>>>>>>>>  ------------------------------
>>>>>>>>  This email message is for the sole use of the intended
>>>>>>>> recipient(s) and may contain confidential information.  Any unauthorized
>>>>>>>> review, use, disclosure or distribution is prohibited.  If you are not the
>>>>>>>> intended recipient, please contact the sender by reply email and destroy
>>>>>>>> all copies of the original message.
>>>>>>>>  ------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Thanks,
>>>>
>>>> Justin Holewinski
>>>>
>>>
>>>
>>
>
>
> --
>
> Thanks,
>
> Justin Holewinski
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130605/9282e23d/attachment.html>