[llvm-dev] Element atomic vector stores: just do it?

Fri Aug 6 03:16:35 PDT 2021

Hi everyone,

Recently I started to work on optimization of @llvm.memcpy.element.unordered.atomic intrinsics to make them available and widely usable. Currently, LLVM provides a langref description for it, but no  baked-in lowering. To change this situation, some steps need to be taken.

The obvious way to lower @llvm.memcpy.element.unordered.atomic is smth like

for (i = 0; i < len; i++) {
  v = load atomic src[i]
  store atomic v, dest[i]
}

But this code itself is awkwardly slow, compared to efficient implementations of regular memcpy. What we would really want to do is

for (i = 0; i < len; i += stride) {
  vector_v = load element atomic <stride x i32>
  store element atomic vector_v, dest[i]
}

However, currently there is no way to express this concept in LLVM. What we can do is

for (i = 0; i < len; i += stride) {
  vector_v = load <stride x i32>
  store vector_v, dest[i]
}

When max vector size is supported on the platform, we can hope (but just hope!) that it will lower into corresponding vector load/stores, such as ymm/xmm load/stores in X86. But:

  *   There is no guarantee of that. Some IR pass theoretically may break the vectors as it pleases because there is no atomicity demand.
     *   I don't think any pass does it in reality, but they have a right to.
  *   Even if it is lowered into ymm/xmm/smth like this, yhere is no guarantee of atomicity of xmm or ymm registers. So even if this code is lowered into corresponding ymm stores, the whole store might not be atomic (especially if it crosses the boundary of cache line).
  *   In codegen level, ymm load/store may be torn. For example, I find this pass that says:
https://github.com/llvm-mirror/llvm/blob/master/lib/Target/X86/X86AvoidStoreForwardingBlocks.cpp
// The pass currently only handles cases where memcpy is lowered to
// XMM/YMM registers, it tries to break the memcpy into smaller copies.
// breaking the memcpy should be possible since there is no atomicity
// guarantee for loads and stores to XMM/YMM.
So breakedge of non-atomic xmm/ymm loads is a real thing that should be accounted for.

However, there is a bit of a positive moment. Specification of @llvm.memcpy.element.unordered.atomic says:

> For each of the input pointers align parameter attribute must be specified. It must be a power of two no less than the element_size. Caller guarantees that both the source and destination pointers are aligned to that boundary.

So if we somehow enforce lowering of the vector stores into hardware supported operations and prohibit other passes from tearing it apart, we'll have ymm loads aligned by some basic type (say i32). It's a widely known that on X86, despite xmm/ymm stores are not atomic, they don't tear words if they are aligned by the width of the word (please correct me if it's not true!). So by just enforcing straightforward lowering of such loads into ymm/xmm loads, and prohibit all passes (including codegen) to touch it, we should have the desired atomic behavior on X86.

This might not be true for other platforms, but we should start somewhere.

Proposition
We could have another flag in load/store instructions, similar to atomic, being element-atomic which only matters for vector loads and stores. We can guarantee that these loads only survive till codegen on platforms that support it. We can also go and update all passes that potentially tear vectors (I don't think there is many) and prohibit them from touching these loads and stores. And we'll also need an assert (on X86) that the pointers are aligned properly.

It doesn't look a very hard task (the only hard thing is to detect all such places), but maybe there are some pitfalls I don't know about.

Please discuss if you have an opinion on how to do it best, and what possible problems do you anticipate.

Thanks,
Max

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210806/9d67a805/attachment.html>