[PATCH] D120129: [NVPTX] Enhance vectorization of ld.param & st.param

Tue Mar 1 03:25:20 PST 2022

kovdan01 added a comment.

In D120129#3338254 <https://reviews.llvm.org/D120129#3338254>, @tra wrote:

> FYI. I've recently proposed to pass the values directly, instead of a byval pointer: https://discourse.llvm.org/t/nvptx-calling-convention-for-aggregate-arguments-passed-by-value/5881
> It needs additional changes to get lowering done correctly. I should have them ready in about a week.

Okay, please let me know when some results are present. Anyway, IMO, we should be able to handle IR with byval pointers used - at least because a person might potentially use frontend different from clang.

>> This change may be done if the function has private or internal linkage.
>
> I think we should be able to do that to all no-kernel functions if we're compiling without -fgpu-rdc. I think we do reduce visibility of non-kernels in that case, but it would be good to make sure.

Okay, I'll investigate it.

>> If S is a multiple of 4 * A, let special alignment be 4 * A.
>> Else, if S is a multiple of 2 * A, let special alignment be 2 * A.
>> Else, let special alignment be A.
>
> I'm not sure if that logic makes sense to me. E.g. if we have `[5 x i32]`, the optimal way to load it would be to align it by 16 and use `ld.v4` + `ld`. With your approach `[4 x i32]` would be loaded ad ld.v4, `[6 x i32]` as `3 * ld.v2`, but `[5 x i32]` would use `5 * ld`. If we can align 4 and 6 element arrays, I do not see why we would not be allowed to align 5-element array, too -- it's an equivalent of `struct { [4 x i32], i32 }` as far as in-memory layout is concerned.
>
> I wonder if we can just always set alignment to 16. Byval pointer for NVPTX is a fiction anyways as we always copy the data when we actually lower those arguments.

I suppose that you are correct and we can always set alignment to 16. The reason why I implemented such logic is that I try to be as conservative as possible. For example, if we have two values of `[5 x i32]` aligned as 4, they might be placed together without any gaps. If we align them as 16 - an additional gap of 12 bytes will appear. If keeping such king of "layout" stable is not important - I'll change the logic so alignment is always 16 in param space.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120129/new/

https://reviews.llvm.org/D120129