[Libclc-dev] Add initial support for half precision builtins (take 3)

Thu May 17 13:57:00 PDT 2018

On Thu, May 17, 2018 at 2:47 PM, Jan Vesely <jan.vesely at rutgers.edu> wrote:
> On Thu, 2018-05-17 at 11:42 -0500, Aaron Watry via Libclc-dev wrote:
>> On Wed, May 16, 2018 at 3:53 PM, Jan Vesely via Libclc-dev
>> <libclc-dev at lists.llvm.org> wrote:
>> > On Wed, 2018-05-16 at 21:20 +0200, Jeroen Ketema via Libclc-dev wrote:
>> > > > On 16 May 2018, at 00:59, Jan Vesely <jan.vesely at rutgers.edu> wrote:
>> > > >
>> > > > Hi,
>> > > >
>> > > > thank you both, for fast feedback. I've attached v2 that should address
>> > > > most of the comments. I left larger whitespace cleanup for another day.
>> > > >
>> > > >
>> > > > On Tue, 2018-05-15 at 22:46 +0200, Jeroen Ketema via Libclc-dev wrote:
>> > > > > Hi Jan,
>> > > > >
>> > > > > > On 15 May 2018, at 05:09, Jan Vesely via Libclc-dev <libclc-dev at lists.llvm.org> wrote:
>> > > > > >
>> > > > > > Hi,
>> > > > > >
>> > > > > > I've tried to post this patch but it's too big for the ML (and I don't
>> > > > > > know who to bug to get it released).
>> > > > > >
>> > > > > > It adds basic fp16 builtins. The rest will probably need a fp32 wrapper
>> > > > > > (especially the sw implementations of math functions), but that needs
>> > > > > > fixed conversion routines.
>> > > > >
>> > > > > Is that going to fly? Wouldn’t you run into double rounding issues?
>> > > >
>> > > > It probably will, but it's the fastest solution. The alternative is to
>> > > > adapt all sw routines to half precision (including tables). At least
>> > > > GCN llvm backend does this for some ops anyway.
>> > > >
>> > > > Fixing conversion routines is higher priority. We can revisit this
>> > > > after.
>> > >
>> > > Yeah, if the aim is just to get something working, then just wrapping
>> > > the fp32 code is fine.
>> > >
>> > > One further remark: I think generic/lib/math/modf.inc needs to #undef ZERO.
>> >
>> > Thanks. I've fixed this locally. Guess since it was the same definition
>> > the compiler didn't complain.
>> >
>> > >
>> > > One question about generic/lib/math/clc_sqrt_impl.inc:
>> > >
>> > > +#define __CLC_NAN (half)NAN
>> > >
>> > > I assume the reason for this is that there’s __builtin_nanh(“”), or something
>> > > like that? (The ‘h’ in there is coming from my imagination.)
>> >
>> > yes. there's no __builtin_nanh(), this relies on the conversion
>> > preserving NaN.
>> >
>> > >
>> > > Otherwise, I don’t have any further comments.
>> >
>> > thanks,
>> > Jan
>>
>> One more thing.
>>
>> I was running through the "quick" CTS test profile yesterday, just to
>> see where we're at, and I noticed the following failures
>> from conformance_Test_half:
>>
>> Testing vstorea_half
>> 268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead  vector_size =
>> 3 address_space = global
>> Testing vstore_half_rte
>> Testing vstorea_half_rte
>> 268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead  vector_size =
>> 3 address_space = global
>> Testing vstore_half_rtz
>> Testing vstorea_half_rtz
>> 268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead  vector_size =
>> 3 address_space = global
>> Testing vstore_half_rtp
>> Testing vstorea_half_rtp
>> 268435452) Failure at 0x1.fffff8p-96: *0x0001 vs 0xdead  vector_size =
>> 3 address_space = global
>> Testing vstore_half_rtn
>> Testing vstorea_half_rtn
>> 268435452) Failure at 0x1.fffff8p-96: *0x0000 vs 0xdead  vector_size =
>> 3 address_space = global
>>
>> Something about the 3-component vstore_half is off, but I'm not sure
>> if this was already broken before, or if the "#if __CLC_FPSIZE > 16"
>> change in vstore_half.inc broke it.
>
> Hi,
>
> this was broken before. vstore/vload_half exist irrespective of
> cl_khr_fp16. The "#if" part is there to prevent vstore_half(half,
> size_t, half*), there's vstore(half, size_t, half*) for that.
> vload_half has similar guard but it only exists for float.
>
> I've actually started looking into this as a simpler problem to fix
> than the conversion generator.
> I found the issue to be in CTS numVecs function(cl_utils.c:327) which
> aligns up for all cases but (aligned && vecsize == 3).
> Thus the last few values are simply not run and return the poison value
> (0xdead). You can check whether "count" is divisible by 4.
>
> A simple patch to numVecs function:
>
> -        return count/4;
> +        return (count + 3)/4;
>
> should fix the issue, I've only checked vstorea_half.
> I'm pretty sure it technically results in OOB access, but thanks to GPU
> buffer size alignment to 64 bytes it shouldn't corrupt memory.
>
> Does the patch look OK to you now? v3 adds "ZERO" undef. Let me know if
> you'd prefer I sent it out as well.

Yeah, I'm ok with this in its current state. I figured there was a
good chance that vstore_half was already failing before your changes,
but I just wanted to check.

Regarding testing, I ran through the CTS quick profile and didn't see
anything obvious going wrong with regards to the half-precision
support.  I also ran the clpeak performance tests successfully without
issue on my RX580.

I previously went through v1 of this change, and I just went through
the diff between v1 and v2.

Looks good to me, and the TODOs left make sense.  We'll get there
eventually... there's no reason to block getting most of the support
in on the few things that aren't finished yet.

Tested/Reviewed-By: Me

--Aaron

>
> thanks,
> Jan
>
>>
>> --Aaron
>>
>> >
>> > >
>> > > Jeroen
>> > >
>> > > >
>> > > > regards,
>> > > > Jan
>> > > >
>> > > > >
>> > > > > In addition to Aaron’s comments:
>> > > > >
>> > > > > +#define HALF_RADIX       2
>> > > > > +#define HALF_MAX         0x1.ffcp15h
>> > > > > +#define HALF_MIN         0x1.0p-14h
>> > > > > +#define HALF_EPSILON     0x1.0p-10h
>> > > > >
>> > > > > There seems to be both a space and a tab between the macro name and
>> > > > > its definition. I also see a few places where indentation is done
>> > > > > with a tab, e.g., generic/lib/relational/isunordered.cl.
>> > > > >
>> > > > > From nan.inc:
>> > > > >
>> > > > > #if __CLC_FPSIZE == 64
>> > > > > #define __CLC_NATN __CLC_XCONCAT(ulong, __CLC_VECSIZE)
>> > > > > -#else
>> > > > > +#elif __CLC_FPSIZE == 32
>> > > > > #define __CLC_NATN __CLC_XCONCAT(uint, __CLC_VECSIZE)
>> > > > > +#else
>> > > > > +#define __CLC_NATN __CLC_XCONCAT(ushort, __CLC_VECSIZE)
>> > > > > #endif
>> > > > >
>> > > > > Wouldn’t it make more sense to also use “#elif __CLC_FPSIZE == 16”, and maybe #error otherwise? Similar in a few other places.
>> > > > >
>> > > > > In generic/lib/math/acos.inc:
>> > > > >
>> > > > > +#else
>> > > > > +#define __CLC_CONST(x) xh
>> > > > >
>> > > > > I think there should be a ## between the x and h.
>> > > > >
>> > > > > In generic/include/math/clc_ldexp.h, the guarded text in indented, while indentation is not used anywhere else.
>> > > > >
>> > > > > I see some cases where the "+#if __CLC_FPSIZE > 16” guard does not have a TODO attached to it, is that on purpose?
>> > > > >
>> > > > > In generic/lib/math/modf.inc casts 0.0f, while generic/lib/math/fract.inc introduces a ZERO macro. Maybe this should be made more uniform (doesn’t have to be as part of this patch).
>> > > > >
>> > > > > Near the end of the patch the int/vector of short difference is explained in a comment "returns an int, but the vector versions return short”. Maybe add a similar comment to generic/include/clc/relational/floatn.inc?
>> > > > >
>> > > > > Jeroen
>> > > > >
>> > > > > >
>> > > > > > This should be enough to make clpeak happy (BZ: 96897)
>> > > > > >
>> > > > > > Jan
>> > > > > > --
>> > > > > > Jan Vesely <jan.vesely at rutgers.edu><0001-Add-initial-support-for-half-precision-builtins.patch.gz>_______________________________________________
>> > > > > > Libclc-dev mailing list
>> > > > > > Libclc-dev at lists.llvm.org
>> > > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/libclc-dev
>> > > > >
>> > > > > _______________________________________________
>> > > > > Libclc-dev mailing list
>> > > > > Libclc-dev at lists.llvm.org <mailto:Libclc-dev at lists.llvm.org>
>> > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/libclc-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/libclc-dev>
>> > > >
>> > > > --
>> > > > Jan Vesely <jan.vesely at rutgers.edu <mailto:jan.vesely at rutgers.edu>><v2-0001-Add-initial-support-for-half-precision-builtins.patch.gz>
>> > >
>> > > _______________________________________________
>> > > Libclc-dev mailing list
>> > > Libclc-dev at lists.llvm.org
>> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/libclc-dev
>> >
>> > _______________________________________________
>> > Libclc-dev mailing list
>> > Libclc-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/libclc-dev
>> >
>>
>> _______________________________________________
>> Libclc-dev mailing list
>> Libclc-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/libclc-dev