[PATCH] D112466: [NVPTX] Drop memory references of LDG/LDU

Wed Nov 3 10:23:15 PDT 2021

tra added a comment.

In D112466#3105940 <https://reviews.llvm.org/D112466#3105940>, @asavonic wrote:

> In D112466#3103803 <https://reviews.llvm.org/D112466#3103803>, @tra wrote:
>
>> While `ldu` does indeed specify that it loads from read-only memory, I do not think we can treat `ld.global.nc` the same way.
>> PTX spec says `Load register variable d from the location specified by the source address operand a in the global state space, and optionally cache in non-coherent texture cache. Since the cache is non-coherent, the data should be read-only within the kernel's process.`
>>
>> The way I read it -- it's a regular load that bypasses cache.
>
> I'm not sure, but I think the spec says that the load is cached:
> `Load [...] from the location [...] in the global state space, and optionally cache in non-coherent texture cache.`
> The problem is that the cache is non-coherent and optional.

`ld.global.nc` is an odd instruction.

The `optionally cache in texture cache` applies only to the new variant of the instruction which specifies caching hints.

  ld.global.nc{.level::eviction_priority}{.level::cache_hint}.type      d, [a]{, cache-policy};
  ld.global.nc{.level::eviction_priority}{.level::cache_hint}.vec.type  d, [a]{, cache-policy};
  ...
  Support for .level::eviction_priority and .level::cache_hint qualifiers introduced in PTX ISA version 7.4.
  Support for .level::eviction_priority qualifier requires sm_70 or higher.
  Support for .level::cache_hint qualifier requires sm_80 or higher.

I think the optionally-caching variants may have to be treated as separate instructions and those would be closer to `ldu` functionality-wise.

>> Unlike `ldu`, it does not specify that it's a read-only data. While read-only data will make it work correctly, it's not the only valid use case.
>> E.g. data may be changed from the host and two subsequent `ld.global.nc` will return different values.
>
> You mean something like this, right?
>
>   ld.global.nc r1, [a]
>   st.global [a], r2           ; may be updated from host?

I mean that host could do `cudaMemcpy(host->device, a, host_ptr, N)`.

> ld.global.nc r3, [a]
>
>   

> I think the result is undefined: r1 and r3 can be equal if the first instruction cached the result. Otherwise, since the cache is optional, r2 and r3 can be equal.

`ld.global.nc` does no caching. If host copies something into 'a' between two loads, r1 and r3 will be different.

> However, what happens for this code?
>
>   st.global [a], r4
>   ld.global.nc r5, [a]
>
> Provided that `a` is not cached before the store, `ld.global.nc` should always load the stored value, right?
> If this is correct, then `ld.global.nc` can behave like a "load" in some cases, and therefore should have `mayLoad` flag.

Result of this code is undefined. In general, writes are posted and the data may not make it to the memory by the time `ld.global.nc` attempts to read it. 
According to PTX docs memory consistency does not apply to ld.global.nc: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#scope-and-applicability

  The memory consistency model does not apply to texture (including ld.global.nc) and surface accesses.

My example of host/device accesses may be a bad one. It would equally affect regular loads/stores, too, and right now we generally assume that only GPU can change memory. 
If that's the case, then `ld.global.nc` should not have `mayLoad` -- we can't guarantee consistency with any store ops anyways, so there's no point affecting their order.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112466/new/

https://reviews.llvm.org/D112466