[Openmp-commits] [PATCH] D98832: [libomptarget] Tune the number of teams and threads for kernel launch.

Dhruva Chakrabarti via Phabricator via Openmp-commits openmp-commits at lists.llvm.org
Fri Mar 19 12:38:00 PDT 2021

dhruvachak added a comment.

In D98832#2637285 <https://reviews.llvm.org/D98832#2637285>, @JonChesterfield wrote:

> In D98832#2635305 <https://reviews.llvm.org/D98832#2635305>, @dhruvachak wrote:
>> ...
>> Agreed. However, I don't see LDS usage in the metadata table in the image. Is it present there?
> Yes, see https://llvm.org/docs/AMDGPUUsage.html for the list of what we can expect. What may not be obvious is that the metadata calls it ".group_segment_fixed_size". I don't know the origin of the terminology, maybe opencl?
>> In theory, a very high sgpr count can limit the number of available workgroups if that's not factored in for determining the number of threads. But in practice, VGPRs tend to be the primary limiting factor. So perhaps we can start with using VGPRs for this purpose and have experience guide us in the future.
> If I understand correctly, occupancy rules all look something like (resource used / resource available) == number simultaneous, where one of the resources tends to be limiting. Offhand, I think that's VGPR, SGPR, LDS (group segment). I think there's also an architecture dependent upper bound on how many things can run at once even if they use very little of those, maybe 8 for gfx9 and 16 for gfx10.
> If that's right, perhaps the calculation should look something like:
>   uint vgpr_occupancy = vgpr_used / vgpr_available;
>   uint sgpr_occupancy = sgpr_used / sgpr_available;
>   uint lds_occupancy = lds_used / lds_available;
>   uint limiting_occupancy = min(vgpr_occupancy, sgpr_occupacny, lds_occupancy);
> and then we derive threadsPerGroup from that occupancy and the various other considerations.

Thanks for the pointer to the group segment. Yes, in general, my idea is similar to what you outlined above. However, note that SGPRs and LDS are at different granularities compared to VGPRs. VGPRs are per-thread, SGPRs are shared within a wavefront, and LDS is shared within a workgroup. So while VGPRs can be used to limit the number of threads, perhaps SGPRs and LDS can be used to limit the number of teams.

Let me split up this patch further. I would like to land the default num_teams change sooner rather than later since that's a simple change and has shown improved performance. So let me separate that out. Incorporating SGPRs/LDS to constrain teams/threads will need more experimentation.



More information about the Openmp-commits mailing list