[cfe-dev] [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

mats petersson via cfe-dev cfe-dev at lists.llvm.org
Fri Jan 12 05:31:53 PST 2018


On 11 January 2018 at 16:39, Liu, Yaxun (Sam) <Yaxun.Liu at amd.com> wrote:

> The workgroup size is usually 64 or 128. The number of workgroups can be
> quite large if the global size is large. If for each local array we waste
> 124 bytes, the total waste could be quite large, considering local memory
> is precious resource for GPU.
>

It is, if you have a local argument that is just 4 bytes. But is that
really typical practical use-case? Doing a local memory allocation in the
first place to store 4 bytes seems a bit excessive.

Also, a simplification would be to do something like this:

alignment = min(round_to_neareast_power_of_2(size), max_alignment),

so you either align to the size of the argument [because there is no CL
type where the alignment is greater than the size of the type itself], or
the maximum alignment. This doesn't require any further arguments to be
passed, but
gives a reasonable alignment. Sure, it's going to align an array of 6 int
values to 32 bytes, but it's not the same loss as rounding everything to
128 bytes, and can be done without changing anything.

--
Mats

>
>
> On the other hand, passing the alignment info and using it is pretty
> straight forward.
>
>
>
> Sam
>
>
>
> *From:* mats.o.petersson at googlemail.com [mailto:mats.o.petersson@
> googlemail.com] *On Behalf Of *mats petersson
> *Sent:* Thursday, January 11, 2018 7:47 AM
> *To:* Anastasia Stulova <Anastasia.Stulova at arm.com>
> *Cc:* Sumner, Brian <Brian.Sumner at amd.com>; Liu, Yaxun (Sam) <
> Yaxun.Liu at amd.com>; cfe-dev (cfe-dev at lists.llvm.org) <
> cfe-dev at lists.llvm.org>; Bader, Alexey (alexey.bader at intel.com) <
> alexey.bader at intel.com>; nd <nd at arm.com>
> *Subject:* Re: [cfe-dev] [RFC][OpenCL] Pass alignment of arguments in
> local addr space for device-side enqueued kernel to __enqueue_kernel
> functions
>
>
>
> I'm still a little bit confused about the background of this. And I
> understand that the actual usecase here may not be something that can be
> shared, but perhaps at least some part of the underlying problem can be
> shared to help with the understanding of the issue is here...
>
> The approach I've taken is to allocate every local argument with the
> "largest alignment requirement" (in other words 128 bytes - this may of
> course vary depending on the HW available in the GPU).
>
> As I see it, this wouldn't lead to THAT much overhead in the allocations,
> as local storage is per work-group, and the number of llocal arguments is,
> hopefully, not a very large number.
>
> Whilst I'm all for saving memory when possible, I'm not sure adding a set
> of alignment values to the argument list of enqueue_kernel, for calls that
> have local arguments, and the extra complexity, even if it's not large, is
> worth the saving of local memory allocations. I'd really like to understand
> why a single large alignment doesn't work in this case.
>
> I'm completely aware that this may be my lack of understanding of
> something - hopefully I will learn something new, if that's the case...
>
> --
>
> Mats
>
>
>
> On 11 January 2018 at 12:02, Anastasia Stulova via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
> Hi Brian,
>
>
>
> Considering the current implementation there is no reason we couldn't
> generate code with arbitrary pointer types instead of void. This is anyways
> implemented as a custom check. I don't know though if there might be
> limitations if using different compilation toolchains or so. Although I can
> imagine this will require custom implementation anywhere. Should we clarify
> this in spec?
>
>
>
> Cheers,
>
> Anastasia
> ------------------------------
>
> *From:* Sumner, Brian <Brian.Sumner at amd.com>
> *Sent:* 10 January 2018 18:26:44
> *To:* Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev (cfe-dev at lists.llvm.org);
> Bader, Alexey (alexey.bader at intel.com)
> *Cc:* nd
>
>
> *Subject:* RE: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> From my perspective, this restriction is nonsense.  OpenCL kernel local*
> arguments are not required to point to void.  Why must block local*
> arguments point to void?  They have to be cast to actually be useful; this
> is an unnecessary extra step.  And unless the actual type is available, the
> kernel enqueue mechanism has no choice to align the storage to 128 bytes
> since any local void * could actually be a local ulong16 *.
>
>
>
> Thanks,
>
> Brian
>
>
>
> *From:* Anastasia Stulova [mailto:Anastasia.Stulova at arm.com]
> *Sent:* Wednesday, January 10, 2018 9:55 AM
> *To:* Liu, Yaxun (Sam); cfe-dev (cfe-dev at lists.llvm.org); Bader, Alexey (
> alexey.bader at intel.com)
> *Cc:* Sumner, Brian; nd
> *Subject:* Re: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> Hi Sam,
>
>
>
> There is a restriction in OpenCL spec I have referenced in my previous
> email - s6.13.17.2, which is implemented by Clang. If you look in the file
> test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see
> that block_B is rejected to be passed into enqueue_kernel because it has a
> parameter which isn't "local void*". If you think this is wrong perhaps it
> would make sense to revisit this bit and understand whether the current
> spec should be changed to allow more optimal implementations to exist. But
> as for the current state, I don't think we can implement what you are
> suggesting because we can only have one block argument type for a block in
> enqueue.
>
>
>
> Cheer,
>
> Anastasia
>
>
> ------------------------------
>
> *From:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>
> *Sent:* 08 January 2018 22:25
> *To:* Anastasia Stulova; cfe-dev (cfe-dev at lists.llvm.org); Bader, Alexey (
> alexey.bader at intel.com)
> *Cc:* Sumner, Brian; nd
> *Subject:* RE: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> My comments are below.
>
>
>
> *From:* Anastasia Stulova [mailto:Anastasia.Stulova at arm.com
> <Anastasia.Stulova at arm.com>]
> *Sent:* Tuesday, December 19, 2017 10:21 AM
> *To:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>; cfe-dev (
> cfe-dev at lists.llvm.org) <cfe-dev at lists.llvm.org>; Bader, Alexey (
> alexey.bader at intel.com) <alexey.bader at intel.com>
> *Cc:* Sumner, Brian <Brian.Sumner at amd.com>; nd <nd at arm.com>
> *Subject:* Re: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> > For example, if a block kernel has argument local int4*. Its alignment
> should be 16 bytes.
>
>
>
> Perhaps I am missing something but I still don't see anything in the spec
> that requires pointers themselves to take alignment from the pointee type.
> In your example int4* should be aligned to the pointer size (either 4 or 8
> bites) while int4 should be 16 byte aligned. Clang will set the alignment
> of load and store operations correctly according to their data types
> specified in the source code (which is mainly inherited from C
> implementation apart from some special data types like vectors). The
> arguments passed to kernels are allocated elsewhere and OpenCL compiler has
> no control over this.
>
>   Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately
> aligned as required by the data type”, which means the pointee of the
> kernel argument of int4* type should be aligned at 16 bytes.
>
>
> Regarding enqueued kernels as far as I understand you suggest to add block
> argument alignment info into builtin? Even though it shouldn't be strictly
> necessary I believe some implementation can indeed be done more efficiently
> using this. So I don't see any problem adding this. However, spec
> (s6.13.17.2) mandates that the enqueued block function only has void* types
> as parameters: "Each argument must be declared to be a void pointer to
> local memory."  So could you elaborate please where exactly do you plan to
> get the optimal alignment from?
>
>   Sam: The block function is passed to the builtin. The argument of the
> block function has the proper data type instead of void* type. Clang can
> deduce the alignment of the pointee of the kernel argument from the block
> function type.
>
>
> Thanks,
> Anastasia
> ------------------------------
>
> *From:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>
> *Sent:* 15 December 2017 19:08
> *To:* Anastasia Stulova; cfe-dev (cfe-dev at lists.llvm.org); Bader, Alexey (
> alexey.bader at intel.com)
> *Cc:* Sumner, Brian; nd
> *Subject:* RE: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> Spec reference:
>
>
>
> OpenCL v2.0 s6.1.5
>
> The OpenCL compiler is responsible for aligning data items to the
> appropriate alignment as required by the data type. For arguments to a
> __kernel function declared to be a pointer to a data type, the OpenCL
> compiler can assume that the pointee is always appropriately aligned as
> required by the data type. The behavior of an unaligned load or store is
> undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn
> functions defined in section 6.13.7.
>
>
>
> s6.2.5
>
> Casting a pointer to a new type represents an unchecked assertion that the
> address is correctly aligned.
>
>
>
> The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states
>
>
>
> A pointer to an object or incomplete type may be converted to a pointer to
> a different object or incomplete type. If the resulting pointer is not
> correctly aligned for the referenced type, the behavior is undefined.
>
>
>
> For example, if a block kernel has argument local int4*. Its alignment
> should be 16 bytes. Passing a pointer aligned to 1 byte may result in
> undefined behavior. Most hardware can still load from the unaligned memory
> but will a performance hit. If runtime wants to avoid the performance hit,
> it has to allocate the buffer at maximum possible alignment e.g. 32 bytes,
> which will result in wasted memory.
>
>
>
> Sam
>
>
>
> *From:* Anastasia Stulova [mailto:Anastasia.Stulova at arm.com
> <Anastasia.Stulova at arm.com>]
> *Sent:* Friday, December 15, 2017 10:40 AM
> *To:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>; cfe-dev (
> cfe-dev at lists.llvm.org) <cfe-dev at lists.llvm.org>; Bader, Alexey (
> alexey.bader at intel.com) <alexey.bader at intel.com>
> *Cc:* Sumner, Brian <Brian.Sumner at amd.com>; nd <nd at arm.com>
> *Subject:* Re: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
>
>
> > OpenCL spec requires that a pointer should be aligned to at least the
> pointee type.
>
>
>
> So a pointer to int16 would be 64 byte aligned? Seems strange though. Can
> you give me the spec reference?
>
> > Otherwise, __enqueue_kernel has to either allocate unaligned local
> buffer, which degrades performance, or allocates local buffer with extra
> alignment therefore wasted memory space.
>
> Can you explain in more details here, please.
>
> Cheer,
> Anastasia
> ------------------------------
>
> *From:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>
> *Sent:* 01 December 2017 19:45
> *To:* Anastasia Stulova; cfe-dev (cfe-dev at lists.llvm.org); Bader, Alexey (
> alexey.bader at intel.com)
> *Cc:* Sumner, Brian
> *Subject:* [RFC][OpenCL] Pass alignment of arguments in local addr space
> for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> Hi,
>
>
>
> OpenCL spec requires that a pointer should be aligned to at least the
> pointee type. Therefore, if a device-side enqueued kernel has a local int*
> argument, it should be aligned to 4 bytes.
>
>
>
> Since these buffers in local addr space are allocated by __enqueue_kernel,
> it needs to know the alignment of these buffers, not just their sizes.
>
>
>
> Although such information is not passed to the original OpenCL builtin
> function enqueue_kernel, it can be obtained by checking the prototype of
> the block invoke function at compile time.
>
>
>
> I would like to create a patch to pass this information to
>  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate
> unaligned local buffer, which degrades performance, or allocates local
> buffer with extra alignment therefore wasted memory space.
>
>
>
> Any comments?
>
>
>
> Thanks.
>
>
>
> Sam
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180112/e7c678cb/attachment.html>


More information about the cfe-dev mailing list