[LLVMdev] [NVPTX] We need an LLVM CUDA math library, after all

Wed Jun 5 05:36:38 PDT 2013

Thanks for the info!  I would be glad to hear of any issues you have
encountered on this path.  I tried to make sure the 3.3 release was fully
compatible with the libdevice implementation shipping with 5.5 (and as far
as I know, it is).  It's just not an officially supported configuration.

Also, I've been meaning to address your -drvcuda issue.  How would you feel
about making that a part of the triple?

On Wed, Jun 5, 2013 at 5:10 AM, Dmitry Mikushin <dmitry at kernelgen.org>wrote:

> Dear all,
>
> FWIW, I've tested libdevice.compute_20.10.bc and
> libdevice.compute_30.10.bc from /cuda/nvvm/libdevice shipped with CUDA 5.5
> preview. IR is compatible with LLVM 3.4 trunk that we use. Results are
> correct, performance - almost the same as what we had before with
> cicc-sniffed IR, or maybe <10% better. Will test libdevice.compute_35.10.bc
> once we will get K20 support.
>
> Thanks for addressing this,
> - D.
>
>
> 2013/2/17 Dmitry Mikushin <dmitry at kernelgen.org>
>
>> > The issue is really that there is no standard math library for PTX.
>>
>> Well, formally, that could very well be true. Moreover, in some parts CPU
>> math standard is impossible to accomplish on parallel architectures,
>> consider, for example errno behavior. But here we are speaking more about
>> practical side. And the practical side is: past 5 years CUDA claims to
>> accelerate compute applications, and it implies having good math support.
>> For clearance, we can drop term "LLVM CUDA math library" and instead speak
>> of the need to have for entire LLVM "the same degree of math support" CUDA
>> currently has for C/C++.
>>
>> If you think having math module outside of backend is more feasible, this
>> is also a way to go, but please see what we need in this case in the first
>> email: anyways, NVPTX backend will have to tell us, which intrinsics he is
>> going to lower, and which ones will make him to crash. So, there is need to
>> modify something in the backend, anyways.
>>
>>
>> - D.
>>
>> 2013/2/17 Justin Holewinski <justin.holewinski at gmail.com>
>>
>>> The X86 back-end just calls into libm:
>>>
>>>   // Always use a library call for pow.
>>>   setOperationAction(ISD::FPOW             , MVT::f32  , Expand);
>>>   setOperationAction(ISD::FPOW             , MVT::f64  , Expand);
>>>   setOperationAction(ISD::FPOW             , MVT::f80  , Expand);
>>>
>>>
>>> The issue is really that there is no standard math library for PTX.  I
>>> agree that this is a pain for most users, but I don't think the right
>>> solution is to embed a whole suite of math functions into the back-end.
>>>  All I'm suggesting is that we instead follow the path of linking in an
>>> external math library of target-specific functions.  Whether you link your
>>> IR with a bitcode library before codegen or have codegen emit library
>>> function calls is an implementation detail, with each having advantages.
>>>  The accuracy modes can be used to pick the proper library function in the
>>> latter case, but I still think library function choice is better left up to
>>> the front-end, and the accuracy attributes are a better fit to drive
>>> optimization.
>>>
>>>
>>> On Sun, Feb 17, 2013 at 9:48 AM, Dmitry Mikushin <dmitry at kernelgen.org>wrote:
>>>
>>>> Hi Justin,
>>>>
>>>> I don't understand, why, for instance, X86 backend handles pow
>>>> automatically, and NVPTX should be a PITA requiring user to bring his own
>>>> pow implementation. Even at a very general level, this limits the interest
>>>> of users to LLVM NVPTX backend. Could you please elaborate on the rationale
>>>> behind your point? Why the accuracy modes I suggested are not sufficient,
>>>> in your opinion?
>>>>
>>>> - D.
>>>>
>>>>
>>>> 2013/2/17 Justin Holewinski <justin.holewinski at gmail.com>
>>>>
>>>>> I would be very hesitant to expose all math library functions as
>>>>> intrinsics.  I believe linking with a target-specific math library is the
>>>>> correct approach, as it decouples the back end from the needs of the source
>>>>> program/language.  Users should be free to use any math library
>>>>> implementation they choose.  Intrinsics are meant for functions that
>>>>> compile down to specific isa features, like fused multiply add and square
>>>>> root.
>>>>>  On Feb 16, 2013 8:46 PM, "Dmitry Mikushin" <dmitry at kernelgen.org>
>>>>> wrote:
>>>>>
>>>>>> Dear Yuan,
>>>>>>
>>>>>> Sorry for delay with reply,
>>>>>>
>>>>>> Answers on your questions could be different, depending on the math
>>>>>> library placement in the code generation pipeline. At KernelGen, we
>>>>>> currently have a user-level CUDA math module, adopted from cicc internals
>>>>>> [1]. It is intended to be linked with the user LLVM IR module, right before
>>>>>> proceeding with the final optimization and backend. Last few months we are
>>>>>> using this method to temporary workaround the absence of many math
>>>>>> functions, to keep up the speed of applications testing in our compiler
>>>>>> test suite. Supplying math in such way is not portable and introduces many
>>>>>> issues, for instance:
>>>>>> 1) The frontend (DragonEgg - in our case) must be taught to emit real
>>>>>> math functions calls instead those of LLVM intrinsics, NVPTX cannot handle
>>>>>> 2) However, not all intrinsics should be replaced by math calls
>>>>>> directly, for example, there is not cdexp call, but it could be modelled
>>>>>> with sincos.
>>>>>> 3) Our math module assumes sm_20, and could be inefficient or
>>>>>> non-portable on other families of GPUs.
>>>>>>
>>>>>> Instead of this approach, I think math library should be implemented
>>>>>> *as a lowering pass in backend*, working directly with intrinsics.
>>>>>> In this case - naming is not important, as well as final optimization is
>>>>>> the job of backend. But there is another important thing: backend should
>>>>>> codegen math with respect to accuracy settings, specified either as backend
>>>>>> options, or as functions attributes (quiet recent addition of LLVM).
>>>>>> Accuracy settings should be:
>>>>>> 1) fast-math (ftz, prec-div, prec-sqrt, fma, etc.)
>>>>>> 2) Use or not GPU-specific low-precision functions (__sin, __cos,
>>>>>> etc.)
>>>>>>
>>>>>> Following latter approach, math handling of NVPTX will conform the
>>>>>> rest of LLVM, and no host-dependant tweaks will be needed.
>>>>>>
>>>>>> I'm also interested to contribute into this developments at
>>>>>> reasonable depth. Moving this part only on our own would slow down the
>>>>>> progess with main targets too much, that's why I'm asking for your help and
>>>>>> cooperation.
>>>>>>
>>>>>> Best regards,
>>>>>> - Dima.
>>>>>>
>>>>>> [1]
>>>>>> https://hpcforge.org/scm/viewvc.php/*checkout*/trunk/src/cuda/include/math.bc?root=kernelgen
>>>>>>
>>>>>> 2013/2/8 Yuan Lin <yulin at nvidia.com>
>>>>>>
>>>>>>> Yes, it helps a lot and we are working on it.****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> A few questions,****
>>>>>>>
>>>>>>> **1)      **What will be your use model of this library? Will you
>>>>>>> run optimization phases after linking with the library? If so, what are
>>>>>>> they?****
>>>>>>>
>>>>>>> **2)      **Do you care if the names of functions differ from those
>>>>>>> in libm? For example, it would be gpusin() instead of sin(). ****
>>>>>>>
>>>>>>> **3)      **Do you need a different library for different host
>>>>>>> platforms? Why?****
>>>>>>>
>>>>>>> **4)      **Any other functions (besides math) you want to see in
>>>>>>> this library?****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> Thanks.****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> Yuan****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> *From:* Dmitry Mikushin [mailto:dmitry at kernelgen.org]
>>>>>>> *Sent:* Thursday, February 07, 2013 2:09 PM
>>>>>>> *To:* Justin Holewinski; LLVM Developers Mailing List
>>>>>>> *Cc:* Yuan Lin
>>>>>>> *Subject:* [NVPTX] We need an LLVM CUDA math library, after all****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> Hi Justin, gentlemen,
>>>>>>>
>>>>>>> I'm afraid I have to escalate this issue at this point. Since it was
>>>>>>> discussed for the first time last summer, it was sufficient for us for a
>>>>>>> while to have lowering of math calls into intrinsics disabled at DragonEgg
>>>>>>> level, and link them against CUDA math functions at LLVM IR level. Now I
>>>>>>> can say: this is not sufficient any longer, and we need NVPTX backend to
>>>>>>> deal with GPU math.
>>>>>>>
>>>>>>> > There also is no standard libm for PTX.
>>>>>>>
>>>>>>> Yes, that's right, but there is an interesting idea to codegen CUDA
>>>>>>> math headers into LLVM IR and link it with user module at IR level. This
>>>>>>> method gives a perfect degree of flexibility with respect to high-level
>>>>>>> languages: the user no longer needs to deal with headers and can have math
>>>>>>> right in the IR, regardless the language it was lowered from. I can confirm
>>>>>>> this method works for us very well with C and Fortran, but in order to make
>>>>>>> accurate replacements of unsupported intrinsics calls, it needs to become
>>>>>>> aware of NVPTX backend capabilities in the form of:
>>>>>>>
>>>>>>> bool NVPTXTargetMachine::****
>>>>>>>
>>>>>>> isIntrinsicSupported(Function& intrinsic) and
>>>>>>> string NVPTXTargetMachine::whichMathCallReplacesIntrinsic(Function&
>>>>>>> intrinsic)
>>>>>>>
>>>>>>> > I would prefer not to lower such things in the back-end since
>>>>>>> different compilers may want to implement such functions differently based
>>>>>>> on speed vs. accuracy trade-offs.
>>>>>>>
>>>>>>> Who are those different compilers? We are LLVM, the complete
>>>>>>> compiler stack, which should handle these things on its specific
>>>>>>> preference. Derived compilers may certainly think different, and it's their
>>>>>>> own business to change anything they want and never contribute back. We
>>>>>>> should not forget there are a lot of derived projects that use LLVM
>>>>>>> directly, like KernelGen or many of those embedded DSLs recently started
>>>>>>> flourishing. Their completeness and future relies on LLVM. For these
>>>>>>> reasons, I would strongly prefer LLVM/NVPTX should supply a reference GPU
>>>>>>> math implementation and invite you and everyone else to form a joint
>>>>>>> roadmap to deliver it.
>>>>>>>
>>>>>>> Before we started, IANAL, but something tells me there could be a
>>>>>>> licensing issue about releasing the LLVM IR emitted from CUDA headers.
>>>>>>> Could you please check this with NVIDIA?
>>>>>>>
>>>>>>> Many thanks,
>>>>>>> - D.
>>>>>>>
>>>>>>> 2012/9/6 Justin Holewinski <justin.holewinski at gmail.com>:
>>>>>>> > On 09/06/2012 10:02 AM, Dmitry N. Mikushin wrote:
>>>>>>> >>
>>>>>>> >> Dear all,
>>>>>>> >>
>>>>>>> >> During app compilation we have a crash in NVPTX backend:
>>>>>>> >>
>>>>>>> >> LLVM ERROR: Cannot select: 0x732b270: i64 =
>>>>>>> ExternalSymbol'__powisf2'
>>>>>>> >> [ID=18]
>>>>>>> >>
>>>>>>> >> As I understand LLVM tries to lower the following call
>>>>>>> >>
>>>>>>> >> %28 = call ptx_device float @llvm.powi.f32(float 2.000000e+00,
>>>>>>> i32 %8)
>>>>>>> >> nounwind readonly
>>>>>>> >>
>>>>>>> >> to device intrinsic. The table llvm/IntrinsicsNVVM.td does not
>>>>>>> contain
>>>>>>> >> such intrinsic, however it should be builtin, according to
>>>>>>> >> cuda/include/math_functions.h
>>>>>>> >
>>>>>>> >
>>>>>>> > It actually gets lowered into an external function call.
>>>>>>> >
>>>>>>> >
>>>>>>> >>
>>>>>>> >> Is my understanding correct, and we need simply add the
>>>>>>> corresponding
>>>>>>> >> definition to llvm/IntrinsicsNVVM.td ? How to do that, what are
>>>>>>> the
>>>>>>> >> rules?
>>>>>>> >
>>>>>>> >
>>>>>>> > PTX does not have an instruction (or simple series of
>>>>>>> instructions) that
>>>>>>> > implements pow, so this will not be handled.  I would prefer not
>>>>>>> to lower
>>>>>>> > such things in the back-end since different compilers may want to
>>>>>>> implement
>>>>>>> > such functions differently based on speed vs. accuracy trade-offs.
>>>>>>> >
>>>>>>> > There also is no standard libm for PTX.  It is up to the
>>>>>>> higher-level
>>>>>>> > compiler to link against a run-time library that provides
>>>>>>> functions like pow
>>>>>>> > (see include/math_functions.h in a CUDA distribution).
>>>>>>> >
>>>>>>> >>
>>>>>>> >> Thanks,
>>>>>>> >> - D.
>>>>>>> >> _______________________________________________
>>>>>>> >> LLVM Developers mailing list
>>>>>>> >> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>> >****
>>>>>>>
>>>>>>> ****
>>>>>>>
>>>>>>> >
>>>>>>> > --
>>>>>>> > Thanks,
>>>>>>> >
>>>>>>> > Justin Holewinski
>>>>>>> >****
>>>>>>>  ------------------------------
>>>>>>>  This email message is for the sole use of the intended
>>>>>>> recipient(s) and may contain confidential information.  Any unauthorized
>>>>>>> review, use, disclosure or distribution is prohibited.  If you are not the
>>>>>>> intended recipient, please contact the sender by reply email and destroy
>>>>>>> all copies of the original message.
>>>>>>>  ------------------------------
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Thanks,
>>>
>>> Justin Holewinski
>>>
>>
>>
>

-- 

Thanks,

Justin Holewinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130605/dbe96a13/attachment.html>