[llvm] [DSE] Split memory intrinsics if they are dead in the middle (PR #75478)

Tue Sep 10 02:23:30 PDT 2024

bjope wrote:

I'm still a bit worried that this will result in lots of regressions, at least for my target.

I wonder if there perhaps should be some kind of TTI hook or a cmd line option to easily disable the whole feature (the latter could make it easier to benchmark by comparing results with/without the optimization enabled). I can of course add that downstream, so unless it will be used by in-tree targets it might seem weird to have such TTI hook.  But to make sure that this lands softly, maybe the feature should be disabled completely in the first version. Limiting any potential revert to flipping some big switch rather than having to revert all the code changes.

Things I'm concerned about when doing some test compiles with this enabled downstream are things like this:

The below would copy 29 bytes instead of 31. But for our target I think it will cost more to do two cross-addrspace memcpy compared to copy 2 more bytes in a single copy (both latency wise, and concerning the memory buses etc).

```
  call void @llvm.memcpy.p0.p2.i32(ptr %p, ptr addrspace(2) ..., i32 31, i1 false)
=>
  %p2 = getelementptr inbounds i8, ptr %p, i16 6
  call void @llvm.memcpy.p0.p2.i32(ptr %p2, ptr addrspace(2) ..., i32 25, i1 false)
  call void @llvm.memcpy.p0.p2.i32(ptr %p, ptr addrspace(2) ..., i32 4, i1 false)

```
The 144 byte memcpy in the example below is split into 123+20. So we only save a single byte. At least our target has optimized memcpy version copying several bytes per iterations in the inner loop (regardless of alignment). So splitting it up will just
result in two loops with an iteration count that isn't a multiple of 8 or 16 etc. This will be slower than just copying 18x8 bytes in a single memcpy.

```
  call void @llvm.memcpy.p0.p0.i32(ptr ..., ptr ..., i32 144, i1 false)
=>
  call void @llvm.memcpy.p0.p0.i32(ptr ..., ptr ..., i32 123, i1 false)
  call void @llvm.memcpy.p0.p0.i32(ptr ..., ptr ..., i32 20, i1 false)
```

For below example with memset being splitted we used to get three 8 byte stores. Now we get two 8 byte stores + one 4 byte store + one 1 byte store, plus extra pointer fiddeling, including increased register pressure. So even if we memset three bytes less the emitted asm looks much worse.
```
  call void @llvm.memset.p0.i16.i32(ptr noundef nonnull align 1 dereferenceable(24) %c, i16 0, i32 24, i1 false)
=>
   %rear2 = getelementptr inbounds i8, ptr %c, i16 4
   call void @llvm.memset.p0.i16.i32(ptr noundef nonnull align 1 dereferenceable(20) %rear2, i16 0, i32 20, i1 false)
   call void @llvm.memset.p0.i16.i32(ptr noundef nonnull align 1 dereferenceable(1) %ci, i16 0, i32 1, i1 false)
```

It is hard for a middle-end pass to know details about the target architecture, and in general we want it to canonicalize to something target independant. But I think it also might be hard to fuse thse memcpy:s back together in the backend. So a bit worried that even if on C-level the developer has decided to use a single memcpy (as a tuning) DSE now might mess up and split it in two memcpy even if it isn't feasible.

In the end I think we need more than the already existing getMaxMemIntrinsicInlineSizeThreshold (which we currently has set to UINT64_MAX for our target). So that might be another reason to land this "softly" by letting backend people add more tweaks to tune this and enable it in a more controlled way.

Maybe even this first implementation should include some more heuristics, such as not splitting an 144 byte memcpy into two "uneven" 123+20 byte pieces, just to save copying of a single byte? (The goal with https://github.com/llvm/llvm-project/issues/72113 was kind of to split up the memset/memcpy when we can avoid touching a larger chunk of memory twice.)

https://github.com/llvm/llvm-project/pull/75478