[cfe-dev] [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

mats petersson via cfe-dev cfe-dev at lists.llvm.org
Thu Jan 11 04:47:09 PST 2018


I'm still a little bit confused about the background of this. And I
understand that the actual usecase here may not be something that can be
shared, but perhaps at least some part of the underlying problem can be
shared to help with the understanding of the issue is here...

The approach I've taken is to allocate every local argument with the
"largest alignment requirement" (in other words 128 bytes - this may of
course vary depending on the HW available in the GPU).

As I see it, this wouldn't lead to THAT much overhead in the allocations,
as local storage is per work-group, and the number of llocal arguments is,
hopefully, not a very large number.

Whilst I'm all for saving memory when possible, I'm not sure adding a set
of alignment values to the argument list of enqueue_kernel, for calls that
have local arguments, and the extra complexity, even if it's not large, is
worth the saving of local memory allocations. I'd really like to understand
why a single large alignment doesn't work in this case.

I'm completely aware that this may be my lack of understanding of something
- hopefully I will learn something new, if that's the case...

--
Mats

On 11 January 2018 at 12:02, Anastasia Stulova via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> Hi Brian,
>
>
> Considering the current implementation there is no reason we couldn't
> generate code with arbitrary pointer types instead of void. This is anyways
> implemented as a custom check. I don't know though if there might be
> limitations if using different compilation toolchains or so. Although I can
> imagine this will require custom implementation anywhere. Should we clarify
> this in spec?
>
>
> Cheers,
>
> Anastasia
> ------------------------------
> *From:* Sumner, Brian <Brian.Sumner at amd.com>
> *Sent:* 10 January 2018 18:26:44
> *To:* Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev (cfe-dev at lists.llvm.org);
> Bader, Alexey (alexey.bader at intel.com)
> *Cc:* nd
>
> *Subject:* RE: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
> From my perspective, this restriction is nonsense.  OpenCL kernel local*
> arguments are not required to point to void.  Why must block local*
> arguments point to void?  They have to be cast to actually be useful; this
> is an unnecessary extra step.  And unless the actual type is available, the
> kernel enqueue mechanism has no choice to align the storage to 128 bytes
> since any local void * could actually be a local ulong16 *.
>
>
>
> Thanks,
>
> Brian
>
>
>
> *From:* Anastasia Stulova [mailto:Anastasia.Stulova at arm.com]
> *Sent:* Wednesday, January 10, 2018 9:55 AM
> *To:* Liu, Yaxun (Sam); cfe-dev (cfe-dev at lists.llvm.org); Bader, Alexey (
> alexey.bader at intel.com)
> *Cc:* Sumner, Brian; nd
> *Subject:* Re: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> Hi Sam,
>
>
>
> There is a restriction in OpenCL spec I have referenced in my previous
> email - s6.13.17.2, which is implemented by Clang. If you look in the file
> test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see
> that block_B is rejected to be passed into enqueue_kernel because it has a
> parameter which isn't "local void*". If you think this is wrong perhaps it
> would make sense to revisit this bit and understand whether the current
> spec should be changed to allow more optimal implementations to exist. But
> as for the current state, I don't think we can implement what you are
> suggesting because we can only have one block argument type for a block in
> enqueue.
>
>
>
> Cheer,
>
> Anastasia
>
>
> ------------------------------
>
> *From:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>
> *Sent:* 08 January 2018 22:25
> *To:* Anastasia Stulova; cfe-dev (cfe-dev at lists.llvm.org); Bader, Alexey (
> alexey.bader at intel.com)
> *Cc:* Sumner, Brian; nd
> *Subject:* RE: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> My comments are below.
>
>
>
> *From:* Anastasia Stulova [mailto:Anastasia.Stulova at arm.com
> <Anastasia.Stulova at arm.com>]
> *Sent:* Tuesday, December 19, 2017 10:21 AM
> *To:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>; cfe-dev (
> cfe-dev at lists.llvm.org) <cfe-dev at lists.llvm.org>; Bader, Alexey (
> alexey.bader at intel.com) <alexey.bader at intel.com>
> *Cc:* Sumner, Brian <Brian.Sumner at amd.com>; nd <nd at arm.com>
> *Subject:* Re: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> > For example, if a block kernel has argument local int4*. Its alignment
> should be 16 bytes.
>
>
>
> Perhaps I am missing something but I still don't see anything in the spec
> that requires pointers themselves to take alignment from the pointee type.
> In your example int4* should be aligned to the pointer size (either 4 or 8
> bites) while int4 should be 16 byte aligned. Clang will set the alignment
> of load and store operations correctly according to their data types
> specified in the source code (which is mainly inherited from C
> implementation apart from some special data types like vectors). The
> arguments passed to kernels are allocated elsewhere and OpenCL compiler has
> no control over this.
>
>   Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately
> aligned as required by the data type”, which means the pointee of the
> kernel argument of int4* type should be aligned at 16 bytes.
>
>
> Regarding enqueued kernels as far as I understand you suggest to add block
> argument alignment info into builtin? Even though it shouldn't be strictly
> necessary I believe some implementation can indeed be done more efficiently
> using this. So I don't see any problem adding this. However, spec
> (s6.13.17.2) mandates that the enqueued block function only has void* types
> as parameters: "Each argument must be declared to be a void pointer to
> local memory."  So could you elaborate please where exactly do you plan to
> get the optimal alignment from?
>
>   Sam: The block function is passed to the builtin. The argument of the
> block function has the proper data type instead of void* type. Clang can
> deduce the alignment of the pointee of the kernel argument from the block
> function type.
>
>
> Thanks,
> Anastasia
> ------------------------------
>
> *From:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>
> *Sent:* 15 December 2017 19:08
> *To:* Anastasia Stulova; cfe-dev (cfe-dev at lists.llvm.org); Bader, Alexey (
> alexey.bader at intel.com)
> *Cc:* Sumner, Brian; nd
> *Subject:* RE: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> Spec reference:
>
>
>
> OpenCL v2.0 s6.1.5
>
> The OpenCL compiler is responsible for aligning data items to the
> appropriate alignment as required by the data type. For arguments to a
> __kernel function declared to be a pointer to a data type, the OpenCL
> compiler can assume that the pointee is always appropriately aligned as
> required by the data type. The behavior of an unaligned load or store is
> undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn
> functions defined in section 6.13.7.
>
>
>
> s6.2.5
>
> Casting a pointer to a new type represents an unchecked assertion that the
> address is correctly aligned.
>
>
>
> The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states
>
>
>
> A pointer to an object or incomplete type may be converted to a pointer to
> a different object or incomplete type. If the resulting pointer is not
> correctly aligned for the referenced type, the behavior is undefined.
>
>
>
> For example, if a block kernel has argument local int4*. Its alignment
> should be 16 bytes. Passing a pointer aligned to 1 byte may result in
> undefined behavior. Most hardware can still load from the unaligned memory
> but will a performance hit. If runtime wants to avoid the performance hit,
> it has to allocate the buffer at maximum possible alignment e.g. 32 bytes,
> which will result in wasted memory.
>
>
>
> Sam
>
>
>
> *From:* Anastasia Stulova [mailto:Anastasia.Stulova at arm.com
> <Anastasia.Stulova at arm.com>]
> *Sent:* Friday, December 15, 2017 10:40 AM
> *To:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>; cfe-dev (
> cfe-dev at lists.llvm.org) <cfe-dev at lists.llvm.org>; Bader, Alexey (
> alexey.bader at intel.com) <alexey.bader at intel.com>
> *Cc:* Sumner, Brian <Brian.Sumner at amd.com>; nd <nd at arm.com>
> *Subject:* Re: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
>
>
> > OpenCL spec requires that a pointer should be aligned to at least the
> pointee type.
>
>
>
> So a pointer to int16 would be 64 byte aligned? Seems strange though. Can
> you give me the spec reference?
>
> > Otherwise, __enqueue_kernel has to either allocate unaligned local
> buffer, which degrades performance, or allocates local buffer with extra
> alignment therefore wasted memory space.
>
> Can you explain in more details here, please.
>
> Cheer,
> Anastasia
> ------------------------------
>
> *From:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>
> *Sent:* 01 December 2017 19:45
> *To:* Anastasia Stulova; cfe-dev (cfe-dev at lists.llvm.org); Bader, Alexey (
> alexey.bader at intel.com)
> *Cc:* Sumner, Brian
> *Subject:* [RFC][OpenCL] Pass alignment of arguments in local addr space
> for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> Hi,
>
>
>
> OpenCL spec requires that a pointer should be aligned to at least the
> pointee type. Therefore, if a device-side enqueued kernel has a local int*
> argument, it should be aligned to 4 bytes.
>
>
>
> Since these buffers in local addr space are allocated by __enqueue_kernel,
> it needs to know the alignment of these buffers, not just their sizes.
>
>
>
> Although such information is not passed to the original OpenCL builtin
> function enqueue_kernel, it can be obtained by checking the prototype of
> the block invoke function at compile time.
>
>
>
> I would like to create a patch to pass this information to
>  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate
> unaligned local buffer, which degrades performance, or allocates local
> buffer with extra alignment therefore wasted memory space.
>
>
>
> Any comments?
>
>
>
> Thanks.
>
>
>
> Sam
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180111/2780f050/attachment.html>


More information about the cfe-dev mailing list