[llvm] [LowerMemIntrinsics] Factor control flow generation out of the memcpy lowering (PR #169039)

Mon Dec 8 00:08:35 PST 2025

ritter-x2a wrote:

> Should this be enabled at O0 at all? We see code size/runtime increase at O0 with the debug info enabled because of this patch (about 3x runtime perf regressions, code size bloat about 1.5x)

@alexey-bataev That is surprising, thanks for checking!
This particular patch should have been mostly NFC, except for (1) more specific (and therefore longer) basic block names and (2) moving a basic block from the lowering that was previously at an unexpected spot to the expected spot.
I suppose the basic block names might end up in some form in the debug info to cause the regression you are seeing. Do you have an example translation unit and compile command showing this code size bloat that you could share?
(Just to be sure we're talking about the same thing: with "runtime perf", you mean the performance of the compiled program, not the compiler, right?)

Regarding if it should run at -O0:
- The lowering of memory intrinsics into loops in general needs to run (for targets where library implementations are not available, like AMDGPU).
- I can see that we might want to have a separate path that only inserts a byte-wise (or maybe word-wise) copy loop for -O0 and -Os (I'm not sure if that's better solved for all targets in LowerMemIntrinsics.cpp or in the Target's TargetTransformInfo::getMemcpyLoopLoweringType. The latter might be more flexible to, e.g., choose a word-lowering instead of bytes for -Os if the difference in encoding size is minor compared to the performance benefit).
  However, as I said above, this patch shouldn't change the access type used for the lowering, so this should be orthogonal to the regression you're seeing.

https://github.com/llvm/llvm-project/pull/169039