[cfe-dev] [RFC][OpenCL] Pass alignment of arguments in local addr space for device-side enqueued kernel to __enqueue_kernel functions

mats petersson via cfe-dev cfe-dev at lists.llvm.org
Mon Jan 15 10:25:35 PST 2018


On 12 January 2018 at 19:32, Liu, Yaxun (Sam) <Yaxun.Liu at amd.com> wrote:

> My comments are below.
>
>
>
> Sam
>
>
>
> *From:* mats.o.petersson at googlemail.com [mailto:mats.o.petersson at googl
> email.com] *On Behalf Of *mats petersson
> *Sent:* Friday, January 12, 2018 8:32 AM
> *To:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>
> *Cc:* Anastasia Stulova <Anastasia.Stulova at arm.com>; Sumner, Brian <
> Brian.Sumner at amd.com>; cfe-dev (cfe-dev at lists.llvm.org) <
> cfe-dev at lists.llvm.org>; Bader, Alexey (alexey.bader at intel.com) <
> alexey.bader at intel.com>; nd <nd at arm.com>
> *Subject:* Re: [cfe-dev] [RFC][OpenCL] Pass alignment of arguments in
> local addr space for device-side enqueued kernel to __enqueue_kernel
> functions
>
>
>
>
>
>
>
> On 11 January 2018 at 16:39, Liu, Yaxun (Sam) <Yaxun.Liu at amd.com> wrote:
>
> The workgroup size is usually 64 or 128. The number of workgroups can be
> quite large if the global size is large. If for each local array we waste
> 124 bytes, the total waste could be quite large, considering local memory
> is precious resource for GPU.
>
>
>
> It is, if you have a local argument that is just 4 bytes. But is that
> really typical practical use-case? Doing a local memory allocation in the
> first place to store 4 bytes seems a bit excessive.
>
> [Sam] The waste of memory could happen to an integer array of any size,
> e.g. int a[10], which only needs to align at 4 bytes. Aligning it to 128
> bytes waste 124 bytes.
>
Clearly not, in this case, the local argument is 40 bytes, and thus the
wastage is at most 68 bytes (128-40). And I'm not arguing that this is not
wasted, I'm trying to understand what the use-case is where the user uses
local memory in such a small amount per workgroup.

> Also, a simplification would be to do something like this:
>
> alignment = min(round_to_neareast_power_of_2(size), max_alignment),
>
> [Sam] We cannot expect how user would use local memory. In certain cases
> the above approach still waste considerable local memory. I think it is
> better to allow user be able to fully utilize their local memory,
> considering the implementation effort is moderate.
>

With the above suggestion, the implementation cost is nearly zero, and you
CAN assume that the user will not access outside the range of the actual
allocated space [that would be UB]. For an int [10], the  "loss" would be
24 bytes, because the rounding up would be to 64 bytes, and the worst
possible case for small buffers is for int [17], which would waste 60
bytes. For large buffers, the worst case can of course still be 124 bytes.

You could potentially do something like (I have not validated this - and it
still needs clamping to 128 or something of course)

     rounded_size = round_to_nearest_smaller_power_2(size);
     if (size != rounded_size)
     {
         alignment = size % rounded_size;
     }
     else
     {
          alignment = rounded_siize;
     }

This will give you an alignment of 8 for int [10], and 4 for int [17].

This does of course assume that someone doesn't try to load 16 of the int
[17] in a vector-instruction that requires alignment of 64, and then load
one element on its own. That wouldn't work well, but that would only work
if the user-call supplied the alignment, which I don't think is the
proposed solution.

Of course, if you have a bunch of different local arguments, of varying
sizes, this will still potentially lead to wasted space, but less so. For
example int [1], int [10], int[1], int [32], int [12] would lead to several
gaps of varying sizes. If this is what is expected - and I don't really
know what use cases there are out there that use local memory combined with
device-side enqueue - then I would say, it may be worth doing this.

Have you investigated some work-loads with regard to how much space you
gain from "the tightest possible packing", compared to my above solution,
the one-line solution, and "round everything to 128"?
Without revealing what the work-loads are, perhaps you could show something
like:
Kernel A: 12, 36, 18, 128 bytes
Kernel B: 116, 236, 240, 256 bytes
Kernel C: ...
[I just made those numbers up, and I don't really expect the numbers to
make any sense compared to real applications and numbers]

Not quite a single-line, but still trivial compared to passing and handling
an array of extra arguments, which requires modification of several
different files, adding new test-cases, etc. [Although you may want to add
some test-cases for this implementation, of course].

--
Mats

>
> so you either align to the size of the argument [because there is no CL
> type where the alignment is greater than the size of the type itself], or
> the maximum alignment. This doesn't require any further arguments to be
> passed, but
>
> gives a reasonable alignment. Sure, it's going to align an array of 6 int
> values to 32 bytes, but it's not the same loss as rounding everything to
> 128 bytes, and can be done without changing anything.
>
>
>
> --
>
> Mats
>
>
>
> On the other hand, passing the alignment info and using it is pretty
> straight forward.
>
>
>
> Sam
>
>
>
> *From:* mats.o.petersson at googlemail.com [mailto:mats.o.petersson at googl
> email.com] *On Behalf Of *mats petersson
> *Sent:* Thursday, January 11, 2018 7:47 AM
> *To:* Anastasia Stulova <Anastasia.Stulova at arm.com>
> *Cc:* Sumner, Brian <Brian.Sumner at amd.com>; Liu, Yaxun (Sam) <
> Yaxun.Liu at amd.com>; cfe-dev (cfe-dev at lists.llvm.org) <
> cfe-dev at lists.llvm.org>; Bader, Alexey (alexey.bader at intel.com) <
> alexey.bader at intel.com>; nd <nd at arm.com>
> *Subject:* Re: [cfe-dev] [RFC][OpenCL] Pass alignment of arguments in
> local addr space for device-side enqueued kernel to __enqueue_kernel
> functions
>
>
>
> I'm still a little bit confused about the background of this. And I
> understand that the actual usecase here may not be something that can be
> shared, but perhaps at least some part of the underlying problem can be
> shared to help with the understanding of the issue is here...
>
> The approach I've taken is to allocate every local argument with the
> "largest alignment requirement" (in other words 128 bytes - this may of
> course vary depending on the HW available in the GPU).
>
> As I see it, this wouldn't lead to THAT much overhead in the allocations,
> as local storage is per work-group, and the number of llocal arguments is,
> hopefully, not a very large number.
>
> Whilst I'm all for saving memory when possible, I'm not sure adding a set
> of alignment values to the argument list of enqueue_kernel, for calls that
> have local arguments, and the extra complexity, even if it's not large, is
> worth the saving of local memory allocations. I'd really like to understand
> why a single large alignment doesn't work in this case.
>
> I'm completely aware that this may be my lack of understanding of
> something - hopefully I will learn something new, if that's the case...
>
> --
>
> Mats
>
>
>
> On 11 January 2018 at 12:02, Anastasia Stulova via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
> Hi Brian,
>
>
>
> Considering the current implementation there is no reason we couldn't
> generate code with arbitrary pointer types instead of void. This is anyways
> implemented as a custom check. I don't know though if there might be
> limitations if using different compilation toolchains or so. Although I can
> imagine this will require custom implementation anywhere. Should we clarify
> this in spec?
>
>
>
> Cheers,
>
> Anastasia
> ------------------------------
>
> *From:* Sumner, Brian <Brian.Sumner at amd.com>
> *Sent:* 10 January 2018 18:26:44
> *To:* Anastasia Stulova; Liu, Yaxun (Sam); cfe-dev (cfe-dev at lists.llvm.org);
> Bader, Alexey (alexey.bader at intel.com)
> *Cc:* nd
>
>
> *Subject:* RE: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> From my perspective, this restriction is nonsense.  OpenCL kernel local*
> arguments are not required to point to void.  Why must block local*
> arguments point to void?  They have to be cast to actually be useful; this
> is an unnecessary extra step.  And unless the actual type is available, the
> kernel enqueue mechanism has no choice to align the storage to 128 bytes
> since any local void * could actually be a local ulong16 *.
>
>
>
> Thanks,
>
> Brian
>
>
>
> *From:* Anastasia Stulova [mailto:Anastasia.Stulova at arm.com]
> *Sent:* Wednesday, January 10, 2018 9:55 AM
> *To:* Liu, Yaxun (Sam); cfe-dev (cfe-dev at lists.llvm.org); Bader, Alexey (
> alexey.bader at intel.com)
> *Cc:* Sumner, Brian; nd
> *Subject:* Re: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> Hi Sam,
>
>
>
> There is a restriction in OpenCL spec I have referenced in my previous
> email - s6.13.17.2, which is implemented by Clang. If you look in the file
> test/SemaOpenCL/cl20-device-side-enqueue.cl around line 117, you will see
> that block_B is rejected to be passed into enqueue_kernel because it has a
> parameter which isn't "local void*". If you think this is wrong perhaps it
> would make sense to revisit this bit and understand whether the current
> spec should be changed to allow more optimal implementations to exist. But
> as for the current state, I don't think we can implement what you are
> suggesting because we can only have one block argument type for a block in
> enqueue.
>
>
>
> Cheer,
>
> Anastasia
>
>
> ------------------------------
>
> *From:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>
> *Sent:* 08 January 2018 22:25
> *To:* Anastasia Stulova; cfe-dev (cfe-dev at lists.llvm.org); Bader, Alexey (
> alexey.bader at intel.com)
> *Cc:* Sumner, Brian; nd
> *Subject:* RE: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> My comments are below.
>
>
>
> *From:* Anastasia Stulova [mailto:Anastasia.Stulova at arm.com
> <Anastasia.Stulova at arm.com>]
> *Sent:* Tuesday, December 19, 2017 10:21 AM
> *To:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>; cfe-dev (
> cfe-dev at lists.llvm.org) <cfe-dev at lists.llvm.org>; Bader, Alexey (
> alexey.bader at intel.com) <alexey.bader at intel.com>
> *Cc:* Sumner, Brian <Brian.Sumner at amd.com>; nd <nd at arm.com>
> *Subject:* Re: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> > For example, if a block kernel has argument local int4*. Its alignment
> should be 16 bytes.
>
>
>
> Perhaps I am missing something but I still don't see anything in the spec
> that requires pointers themselves to take alignment from the pointee type.
> In your example int4* should be aligned to the pointer size (either 4 or 8
> bites) while int4 should be 16 byte aligned. Clang will set the alignment
> of load and store operations correctly according to their data types
> specified in the source code (which is mainly inherited from C
> implementation apart from some special data types like vectors). The
> arguments passed to kernels are allocated elsewhere and OpenCL compiler has
> no control over this.
>
>   Sam: The spec (v2.0s6.1.5) requires “the pointee is always appropriately
> aligned as required by the data type”, which means the pointee of the
> kernel argument of int4* type should be aligned at 16 bytes.
>
>
> Regarding enqueued kernels as far as I understand you suggest to add block
> argument alignment info into builtin? Even though it shouldn't be strictly
> necessary I believe some implementation can indeed be done more efficiently
> using this. So I don't see any problem adding this. However, spec
> (s6.13.17.2) mandates that the enqueued block function only has void* types
> as parameters: "Each argument must be declared to be a void pointer to
> local memory."  So could you elaborate please where exactly do you plan to
> get the optimal alignment from?
>
>   Sam: The block function is passed to the builtin. The argument of the
> block function has the proper data type instead of void* type. Clang can
> deduce the alignment of the pointee of the kernel argument from the block
> function type.
>
>
> Thanks,
> Anastasia
> ------------------------------
>
> *From:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>
> *Sent:* 15 December 2017 19:08
> *To:* Anastasia Stulova; cfe-dev (cfe-dev at lists.llvm.org); Bader, Alexey (
> alexey.bader at intel.com)
> *Cc:* Sumner, Brian; nd
> *Subject:* RE: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> Spec reference:
>
>
>
> OpenCL v2.0 s6.1.5
>
> The OpenCL compiler is responsible for aligning data items to the
> appropriate alignment as required by the data type. For arguments to a
> __kernel function declared to be a pointer to a data type, the OpenCL
> compiler can assume that the pointee is always appropriately aligned as
> required by the data type. The behavior of an unaligned load or store is
> undefined, except for the vloadn, vload_halfn, vstoren, and vstore_halfn
> functions defined in section 6.13.7.
>
>
>
> s6.2.5
>
> Casting a pointer to a new type represents an unchecked assertion that the
> address is correctly aligned.
>
>
>
> The C Standard, 6.3.2.3, paragraph 7 [ISO/IEC 9899:2011], states
>
>
>
> A pointer to an object or incomplete type may be converted to a pointer to
> a different object or incomplete type. If the resulting pointer is not
> correctly aligned for the referenced type, the behavior is undefined.
>
>
>
> For example, if a block kernel has argument local int4*. Its alignment
> should be 16 bytes. Passing a pointer aligned to 1 byte may result in
> undefined behavior. Most hardware can still load from the unaligned memory
> but will a performance hit. If runtime wants to avoid the performance hit,
> it has to allocate the buffer at maximum possible alignment e.g. 32 bytes,
> which will result in wasted memory.
>
>
>
> Sam
>
>
>
> *From:* Anastasia Stulova [mailto:Anastasia.Stulova at arm.com
> <Anastasia.Stulova at arm.com>]
> *Sent:* Friday, December 15, 2017 10:40 AM
> *To:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>; cfe-dev (
> cfe-dev at lists.llvm.org) <cfe-dev at lists.llvm.org>; Bader, Alexey (
> alexey.bader at intel.com) <alexey.bader at intel.com>
> *Cc:* Sumner, Brian <Brian.Sumner at amd.com>; nd <nd at arm.com>
> *Subject:* Re: [RFC][OpenCL] Pass alignment of arguments in local addr
> space for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
>
>
> > OpenCL spec requires that a pointer should be aligned to at least the
> pointee type.
>
>
>
> So a pointer to int16 would be 64 byte aligned? Seems strange though. Can
> you give me the spec reference?
>
> > Otherwise, __enqueue_kernel has to either allocate unaligned local
> buffer, which degrades performance, or allocates local buffer with extra
> alignment therefore wasted memory space.
>
> Can you explain in more details here, please.
>
> Cheer,
> Anastasia
> ------------------------------
>
> *From:* Liu, Yaxun (Sam) <Yaxun.Liu at amd.com>
> *Sent:* 01 December 2017 19:45
> *To:* Anastasia Stulova; cfe-dev (cfe-dev at lists.llvm.org); Bader, Alexey (
> alexey.bader at intel.com)
> *Cc:* Sumner, Brian
> *Subject:* [RFC][OpenCL] Pass alignment of arguments in local addr space
> for device-side enqueued kernel to __enqueue_kernel functions
>
>
>
> Hi,
>
>
>
> OpenCL spec requires that a pointer should be aligned to at least the
> pointee type. Therefore, if a device-side enqueued kernel has a local int*
> argument, it should be aligned to 4 bytes.
>
>
>
> Since these buffers in local addr space are allocated by __enqueue_kernel,
> it needs to know the alignment of these buffers, not just their sizes.
>
>
>
> Although such information is not passed to the original OpenCL builtin
> function enqueue_kernel, it can be obtained by checking the prototype of
> the block invoke function at compile time.
>
>
>
> I would like to create a patch to pass this information to
>  __enqueue_kernel. Otherwise, __enqueue_kernel has to either allocate
> unaligned local buffer, which degrades performance, or allocates local
> buffer with extra alignment therefore wasted memory space.
>
>
>
> Any comments?
>
>
>
> Thanks.
>
>
>
> Sam
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180115/4ee6dcac/attachment.html>


More information about the cfe-dev mailing list