[PATCH] D112466: [NVPTX] Drop memory references of LDG/LDU
Artem Belevich via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Nov 3 10:23:15 PDT 2021
tra added a comment.
In D112466#3105940 <https://reviews.llvm.org/D112466#3105940>, @asavonic wrote:
> In D112466#3103803 <https://reviews.llvm.org/D112466#3103803>, @tra wrote:
>
>> While `ldu` does indeed specify that it loads from read-only memory, I do not think we can treat `ld.global.nc` the same way.
>> PTX spec says `Load register variable d from the location specified by the source address operand a in the global state space, and optionally cache in non-coherent texture cache. Since the cache is non-coherent, the data should be read-only within the kernel's process.`
>>
>> The way I read it -- it's a regular load that bypasses cache.
>
> I'm not sure, but I think the spec says that the load is cached:
> `Load [...] from the location [...] in the global state space, and optionally cache in non-coherent texture cache.`
> The problem is that the cache is non-coherent and optional.
`ld.global.nc` is an odd instruction.
The `optionally cache in texture cache` applies only to the new variant of the instruction which specifies caching hints.
ld.global.nc{.level::eviction_priority}{.level::cache_hint}.type d, [a]{, cache-policy};
ld.global.nc{.level::eviction_priority}{.level::cache_hint}.vec.type d, [a]{, cache-policy};
...
Support for .level::eviction_priority and .level::cache_hint qualifiers introduced in PTX ISA version 7.4.
Support for .level::eviction_priority qualifier requires sm_70 or higher.
Support for .level::cache_hint qualifier requires sm_80 or higher.
I think the optionally-caching variants may have to be treated as separate instructions and those would be closer to `ldu` functionality-wise.
>> Unlike `ldu`, it does not specify that it's a read-only data. While read-only data will make it work correctly, it's not the only valid use case.
>> E.g. data may be changed from the host and two subsequent `ld.global.nc` will return different values.
>
> You mean something like this, right?
>
> ld.global.nc r1, [a]
> st.global [a], r2 ; may be updated from host?
I mean that host could do `cudaMemcpy(host->device, a, host_ptr, N)`.
> ld.global.nc r3, [a]
>
>
> I think the result is undefined: r1 and r3 can be equal if the first instruction cached the result. Otherwise, since the cache is optional, r2 and r3 can be equal.
`ld.global.nc` does no caching. If host copies something into 'a' between two loads, r1 and r3 will be different.
> However, what happens for this code?
>
> st.global [a], r4
> ld.global.nc r5, [a]
>
> Provided that `a` is not cached before the store, `ld.global.nc` should always load the stored value, right?
> If this is correct, then `ld.global.nc` can behave like a "load" in some cases, and therefore should have `mayLoad` flag.
Result of this code is undefined. In general, writes are posted and the data may not make it to the memory by the time `ld.global.nc` attempts to read it.
According to PTX docs memory consistency does not apply to ld.global.nc: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#scope-and-applicability
The memory consistency model does not apply to texture (including ld.global.nc) and surface accesses.
My example of host/device accesses may be a bad one. It would equally affect regular loads/stores, too, and right now we generally assume that only GPU can change memory.
If that's the case, then `ld.global.nc` should not have `mayLoad` -- we can't guarantee consistency with any store ops anyways, so there's no point affecting their order.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D112466/new/
https://reviews.llvm.org/D112466
More information about the llvm-commits
mailing list