[LLVMdev] [NVPTX] We need an LLVM CUDA math library, after all

Wed Jun 5 02:10:26 PDT 2013

Dear all,

FWIW, I've tested libdevice.compute_20.10.bc and libdevice.compute_30.10.bc
from /cuda/nvvm/libdevice shipped with CUDA 5.5 preview. IR is compatible
with LLVM 3.4 trunk that we use. Results are correct, performance - almost
the same as what we had before with cicc-sniffed IR, or maybe <10% better.
Will test libdevice.compute_35.10.bc once we will get K20 support.

Thanks for addressing this,
- D.

2013/2/17 Dmitry Mikushin <dmitry at kernelgen.org>

> > The issue is really that there is no standard math library for PTX.
>
> Well, formally, that could very well be true. Moreover, in some parts CPU
> math standard is impossible to accomplish on parallel architectures,
> consider, for example errno behavior. But here we are speaking more about
> practical side. And the practical side is: past 5 years CUDA claims to
> accelerate compute applications, and it implies having good math support.
> For clearance, we can drop term "LLVM CUDA math library" and instead speak
> of the need to have for entire LLVM "the same degree of math support" CUDA
> currently has for C/C++.
>
> If you think having math module outside of backend is more feasible, this
> is also a way to go, but please see what we need in this case in the first
> email: anyways, NVPTX backend will have to tell us, which intrinsics he is
> going to lower, and which ones will make him to crash. So, there is need to
> modify something in the backend, anyways.
>
>
> - D.
>
> 2013/2/17 Justin Holewinski <justin.holewinski at gmail.com>
>
>> The X86 back-end just calls into libm:
>>
>>   // Always use a library call for pow.
>>   setOperationAction(ISD::FPOW             , MVT::f32  , Expand);
>>   setOperationAction(ISD::FPOW             , MVT::f64  , Expand);
>>   setOperationAction(ISD::FPOW             , MVT::f80  , Expand);
>>
>>
>> The issue is really that there is no standard math library for PTX.  I
>> agree that this is a pain for most users, but I don't think the right
>> solution is to embed a whole suite of math functions into the back-end.
>>  All I'm suggesting is that we instead follow the path of linking in an
>> external math library of target-specific functions.  Whether you link your
>> IR with a bitcode library before codegen or have codegen emit library
>> function calls is an implementation detail, with each having advantages.
>>  The accuracy modes can be used to pick the proper library function in the
>> latter case, but I still think library function choice is better left up to
>> the front-end, and the accuracy attributes are a better fit to drive
>> optimization.
>>
>>
>> On Sun, Feb 17, 2013 at 9:48 AM, Dmitry Mikushin <dmitry at kernelgen.org>wrote:
>>
>>> Hi Justin,
>>>
>>> I don't understand, why, for instance, X86 backend handles pow
>>> automatically, and NVPTX should be a PITA requiring user to bring his own
>>> pow implementation. Even at a very general level, this limits the interest
>>> of users to LLVM NVPTX backend. Could you please elaborate on the rationale
>>> behind your point? Why the accuracy modes I suggested are not sufficient,
>>> in your opinion?
>>>
>>> - D.
>>>
>>>
>>> 2013/2/17 Justin Holewinski <justin.holewinski at gmail.com>
>>>
>>>> I would be very hesitant to expose all math library functions as
>>>> intrinsics.  I believe linking with a target-specific math library is the
>>>> correct approach, as it decouples the back end from the needs of the source
>>>> program/language.  Users should be free to use any math library
>>>> implementation they choose.  Intrinsics are meant for functions that
>>>> compile down to specific isa features, like fused multiply add and square
>>>> root.
>>>>  On Feb 16, 2013 8:46 PM, "Dmitry Mikushin" <dmitry at kernelgen.org>
>>>> wrote:
>>>>
>>>>> Dear Yuan,
>>>>>
>>>>> Sorry for delay with reply,
>>>>>
>>>>> Answers on your questions could be different, depending on the math
>>>>> library placement in the code generation pipeline. At KernelGen, we
>>>>> currently have a user-level CUDA math module, adopted from cicc internals
>>>>> [1]. It is intended to be linked with the user LLVM IR module, right before
>>>>> proceeding with the final optimization and backend. Last few months we are
>>>>> using this method to temporary workaround the absence of many math
>>>>> functions, to keep up the speed of applications testing in our compiler
>>>>> test suite. Supplying math in such way is not portable and introduces many
>>>>> issues, for instance:
>>>>> 1) The frontend (DragonEgg - in our case) must be taught to emit real
>>>>> math functions calls instead those of LLVM intrinsics, NVPTX cannot handle
>>>>> 2) However, not all intrinsics should be replaced by math calls
>>>>> directly, for example, there is not cdexp call, but it could be modelled
>>>>> with sincos.
>>>>> 3) Our math module assumes sm_20, and could be inefficient or
>>>>> non-portable on other families of GPUs.
>>>>>
>>>>> Instead of this approach, I think math library should be implemented *as
>>>>> a lowering pass in backend*, working directly with intrinsics. In
>>>>> this case - naming is not important, as well as final optimization is the
>>>>> job of backend. But there is another important thing: backend should
>>>>> codegen math with respect to accuracy settings, specified either as backend
>>>>> options, or as functions attributes (quiet recent addition of LLVM).
>>>>> Accuracy settings should be:
>>>>> 1) fast-math (ftz, prec-div, prec-sqrt, fma, etc.)
>>>>> 2) Use or not GPU-specific low-precision functions (__sin, __cos, etc.)
>>>>>
>>>>> Following latter approach, math handling of NVPTX will conform the
>>>>> rest of LLVM, and no host-dependant tweaks will be needed.
>>>>>
>>>>> I'm also interested to contribute into this developments at reasonable
>>>>> depth. Moving this part only on our own would slow down the progess with
>>>>> main targets too much, that's why I'm asking for your help and cooperation.
>>>>>
>>>>> Best regards,
>>>>> - Dima.
>>>>>
>>>>> [1]
>>>>> https://hpcforge.org/scm/viewvc.php/*checkout*/trunk/src/cuda/include/math.bc?root=kernelgen
>>>>>
>>>>> 2013/2/8 Yuan Lin <yulin at nvidia.com>
>>>>>
>>>>>> Yes, it helps a lot and we are working on it.****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> A few questions,****
>>>>>>
>>>>>> **1)      **What will be your use model of this library? Will you
>>>>>> run optimization phases after linking with the library? If so, what are
>>>>>> they?****
>>>>>>
>>>>>> **2)      **Do you care if the names of functions differ from those
>>>>>> in libm? For example, it would be gpusin() instead of sin(). ****
>>>>>>
>>>>>> **3)      **Do you need a different library for different host
>>>>>> platforms? Why?****
>>>>>>
>>>>>> **4)      **Any other functions (besides math) you want to see in
>>>>>> this library?****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> Thanks.****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> Yuan****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> *From:* Dmitry Mikushin [mailto:dmitry at kernelgen.org]
>>>>>> *Sent:* Thursday, February 07, 2013 2:09 PM
>>>>>> *To:* Justin Holewinski; LLVM Developers Mailing List
>>>>>> *Cc:* Yuan Lin
>>>>>> *Subject:* [NVPTX] We need an LLVM CUDA math library, after all****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> Hi Justin, gentlemen,
>>>>>>
>>>>>> I'm afraid I have to escalate this issue at this point. Since it was
>>>>>> discussed for the first time last summer, it was sufficient for us for a
>>>>>> while to have lowering of math calls into intrinsics disabled at DragonEgg
>>>>>> level, and link them against CUDA math functions at LLVM IR level. Now I
>>>>>> can say: this is not sufficient any longer, and we need NVPTX backend to
>>>>>> deal with GPU math.
>>>>>>
>>>>>> > There also is no standard libm for PTX.
>>>>>>
>>>>>> Yes, that's right, but there is an interesting idea to codegen CUDA
>>>>>> math headers into LLVM IR and link it with user module at IR level. This
>>>>>> method gives a perfect degree of flexibility with respect to high-level
>>>>>> languages: the user no longer needs to deal with headers and can have math
>>>>>> right in the IR, regardless the language it was lowered from. I can confirm
>>>>>> this method works for us very well with C and Fortran, but in order to make
>>>>>> accurate replacements of unsupported intrinsics calls, it needs to become
>>>>>> aware of NVPTX backend capabilities in the form of:
>>>>>>
>>>>>> bool NVPTXTargetMachine::****
>>>>>>
>>>>>> isIntrinsicSupported(Function& intrinsic) and
>>>>>> string NVPTXTargetMachine::whichMathCallReplacesIntrinsic(Function&
>>>>>> intrinsic)
>>>>>>
>>>>>> > I would prefer not to lower such things in the back-end since
>>>>>> different compilers may want to implement such functions differently based
>>>>>> on speed vs. accuracy trade-offs.
>>>>>>
>>>>>> Who are those different compilers? We are LLVM, the complete compiler
>>>>>> stack, which should handle these things on its specific preference. Derived
>>>>>> compilers may certainly think different, and it's their own business to
>>>>>> change anything they want and never contribute back. We should not forget
>>>>>> there are a lot of derived projects that use LLVM directly, like KernelGen
>>>>>> or many of those embedded DSLs recently started flourishing. Their
>>>>>> completeness and future relies on LLVM. For these reasons, I would strongly
>>>>>> prefer LLVM/NVPTX should supply a reference GPU math implementation and
>>>>>> invite you and everyone else to form a joint roadmap to deliver it.
>>>>>>
>>>>>> Before we started, IANAL, but something tells me there could be a
>>>>>> licensing issue about releasing the LLVM IR emitted from CUDA headers.
>>>>>> Could you please check this with NVIDIA?
>>>>>>
>>>>>> Many thanks,
>>>>>> - D.
>>>>>>
>>>>>> 2012/9/6 Justin Holewinski <justin.holewinski at gmail.com>:
>>>>>> > On 09/06/2012 10:02 AM, Dmitry N. Mikushin wrote:
>>>>>> >>
>>>>>> >> Dear all,
>>>>>> >>
>>>>>> >> During app compilation we have a crash in NVPTX backend:
>>>>>> >>
>>>>>> >> LLVM ERROR: Cannot select: 0x732b270: i64 =
>>>>>> ExternalSymbol'__powisf2'
>>>>>> >> [ID=18]
>>>>>> >>
>>>>>> >> As I understand LLVM tries to lower the following call
>>>>>> >>
>>>>>> >> %28 = call ptx_device float @llvm.powi.f32(float 2.000000e+00, i32
>>>>>> %8)
>>>>>> >> nounwind readonly
>>>>>> >>
>>>>>> >> to device intrinsic. The table llvm/IntrinsicsNVVM.td does not
>>>>>> contain
>>>>>> >> such intrinsic, however it should be builtin, according to
>>>>>> >> cuda/include/math_functions.h
>>>>>> >
>>>>>> >
>>>>>> > It actually gets lowered into an external function call.
>>>>>> >
>>>>>> >
>>>>>> >>
>>>>>> >> Is my understanding correct, and we need simply add the
>>>>>> corresponding
>>>>>> >> definition to llvm/IntrinsicsNVVM.td ? How to do that, what are the
>>>>>> >> rules?
>>>>>> >
>>>>>> >
>>>>>> > PTX does not have an instruction (or simple series of instructions)
>>>>>> that
>>>>>> > implements pow, so this will not be handled.  I would prefer not to
>>>>>> lower
>>>>>> > such things in the back-end since different compilers may want to
>>>>>> implement
>>>>>> > such functions differently based on speed vs. accuracy trade-offs.
>>>>>> >
>>>>>> > There also is no standard libm for PTX.  It is up to the
>>>>>> higher-level
>>>>>> > compiler to link against a run-time library that provides functions
>>>>>> like pow
>>>>>> > (see include/math_functions.h in a CUDA distribution).
>>>>>> >
>>>>>> >>
>>>>>> >> Thanks,
>>>>>> >> - D.
>>>>>> >> _______________________________________________
>>>>>> >> LLVM Developers mailing list
>>>>>> >> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>> >****
>>>>>>
>>>>>> ****
>>>>>>
>>>>>> >
>>>>>> > --
>>>>>> > Thanks,
>>>>>> >
>>>>>> > Justin Holewinski
>>>>>> >****
>>>>>>  ------------------------------
>>>>>>  This email message is for the sole use of the intended recipient(s)
>>>>>> and may contain confidential information.  Any unauthorized review, use,
>>>>>> disclosure or distribution is prohibited.  If you are not the intended
>>>>>> recipient, please contact the sender by reply email and destroy all copies
>>>>>> of the original message.
>>>>>>  ------------------------------
>>>>>>
>>>>>>
>>>>>
>>>
>>
>>
>> --
>>
>> Thanks,
>>
>> Justin Holewinski
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130605/078f7119/attachment.html>