[llvm] [NVPTX] Improve copy avoidance during lowering. (PR #106423)

Wed Aug 28 16:18:00 PDT 2024

Artem-B wrote:

@fiigii :
> The overhead is similar to generic loads vs non-generic loads. Generic loads via cvta.param are slower because it requires runtime address space check and conversion.

I was under impression that it's the conversion that requires address boundary checks and, possibly, some math to shift the address to generic address space, but then the SASS-level `LD*` instructions should not be affected much, modulo whatever timings those instructions have inherently in hardware. NVIDIA does not provide any public info on that. My default assumption is that accesses to the same memory via pointers in different AS would perform about the same, as long as we're accessing the same memory. We may see noticeable differences when `ld.param` translates into a load from an argument passed in register, but AFAICT, that's not the case for the kernel arguments.

https://github.com/llvm/llvm-project/pull/106423