[llvm] [SROA] Allow rewriting memcpy depending on tbaa.struct (PR #77597)

Wed Jan 10 06:34:12 PST 2024

bjope wrote:

> I'm somewhat skeptical about making SROA use TBAA metadata.

I'm not surprised and I'm also a bit skeptical. Yet, the comments in CGExprAgg that adds tbaa.struct on aggregate copies is indicating that tbaa.struct is added to aid expansion of memcpy into scalar memory ops. So it's weird if the we first add tbaa.struct metadata to aid SROA in doing this kind of optimizations and then aren't allowed to use it.

> 
> I think there may be an alternative fix here: Require that the new alloca type is always "correct" for all uses. This means reverting my change, and instead changing how the SliceTy is determined in: https://github.com/llvm/llvm-project/blob/db74d290b229efb64698996e8647e1cca264d84f/llvm/lib/Transforms/Scalar/SROA.cpp#L4695C1-L4695C1

I did look a lot at rewritePartition/findCommonType/getTypePartition. But if the frontend has emitted a memcpy in the first place then it is a bit hard to find a common type that isn't including the padding bits. As there could be a chain of memcpy so one would need to ensure that the memory being copied is initialized by a scalar store, and that no pointers into the memory escape (that all the final users are loads using the scalar type).

> 
> In particular, we would drop the "If not, can we find an appropriate subtype in the original allocated type?" part of the logic and adjust findCommonType() to work with non-byte-sized integers. I think that would both address the original issue while still producing the desired code in your case?
> 
> Though at a high-level, we should really just not ever emit loads and stored with non-byte-sized types in the first place. Clang's use of these for `_BitInt` violates the SysV ABI, so needs to be changed anyway. We probably can't outright forbid these in LLVM, but frontends really shouldn't be generating them.

Right now our target has builtins operating on { i40, i40 }. So basically i40 is a builtin 40-bit type, with a store size of 48 bits (which is three bytes given an address unit size of 16 bits).
We could perhaps describe a load as "load i48 followed by trunc to i40" and hopefully ISel can lower it to a single i40 load to a 40-bit register. But it is difficult to describe a store as "anyextend i40 to i48 followed by store i48", to let ISel figure out that it can use a 40-bit store operation to write the 48 bits.

This has (amazingly) worked well in the past. But now, after https://github.com/llvm/llvm-project/commit/54067c5fbe9fc13ab195cdddb8f17e18d72b5fe4, when SROA failed to eliminate the aggregate copy via memcpy it ended up as a complete mess as I got a feeling that for example the LoopVectorizer wasn't happy with memcpy:s remaining inside the loop after SROA. At least one of the larger regressions involved lack of vectorization.

One option that I haven't fully explored yet would be to change our builtin {i40, i40} type in some way. Such as making sure that the CGExprAgg isn't using a memcpy in the first place. But then it would still not work in general for user defined structs etc. And maybe I would need to change the ABI for those builtin types.

https://github.com/llvm/llvm-project/pull/77597