[PATCH] D140089: [MemCpyOpt] Add a stack-move optimization to opportunistically merge allocas together, disabled by default.

Thu Dec 15 02:01:52 PST 2022

pcwalton created this revision.
Herald added subscribers: JDevlieghere, hiraditya.
Herald added a project: All.
pcwalton requested review of this revision.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.

This patch adds a new feature to the memcpy optimizer known as the stack-move
optimization, intended primarily for the Rust language. It detects the pattern
whereby memory is copied from one stack slot to another stack slot in such a
way that the destination and source are neither captured nor simultaneously
live. In such cases, it optimizes the pattern by merging the two allocas into
one and deleting the memcpy.

Without any changes to the frontend, this optimization eliminates over 17% of
the memcpys between stack slots in the self-hosted Rust compiler. The number
rises to approximately 42% if the Rust frontend is patched to apply nocapture
to all pointer-valued arguments. That patch is of course an incorrect change,
but a Rust compiler compiled in this way works, which is evidence that further
improvements to the Rust frontend to deduce nocapture in more cases can soundly
push the number of memcpys that this optimization eliminates toward the higher
end of that range.

In order to determine that the optimization is safe to apply, this patch
performs a coarse-grained conservative liveness analysis of the two stack slots
on the basic block level. This may appear to be slow, but measurement has
demonstrated the compile-time overhead of this optimization to be approximately
0.07%, which seems well worth it for the large amount of stack traffic that it
eliminates in typical Rust code. Nevertheless, in order to reduce the compile
time impact of this optimization, especially for other languages in which it
isn't expected to help as much, it's disabled by default and has a
frontend-configurable cap on the number of basic blocks that it's allowed to
examine. Although the optimization should be sound for any language according
to the semantics of LLVM IR, only the Rust compiler is expected to enable it by
default.

The liveness analysis is built on top of the CaptureTracking framework so that
the pointer use analysis that capture tracking performs can be reused to
calculate live ranges. The actual analysis is the efficient
"variable-at-a-time" version of liveness analysis, which has a time complexity
of O(number of variables * number of instructions) in the worst case. Since we
only have two variables in this case, and we work on the basic block level
instead of on the instruction level (with the exception of the basic block
containing the memcpy itself), this reduces to O(number of basic blocks +
number of instructions in that one block). In practice, this seems to be
efficient enough to avoid a significant compile time regression.

Additionally, this optimization "shrink wraps" any lifetime markers around the
allocas to the nearest common dominator and postdominator blocks. For this, it
needs the postdominator tree. This doesn't seem to cause a noticeable
difference in compile times, because the postdominator tree is needed by dead
store elimination anyway, which typically follows memcpy optimization.
Nonetheless, to be maximally cautious regarding compile time, we only require
the postdominator tree if the optimization is enabled.

The following tests fail if modified to enable the stack-move optimization.
Note that this patch doesn't update these tests because the stack-move
optimization isn't enabled by default, so this patch as submitted doesn't break
them.

- test/Analysis/ScopedNoAliasAA/alias-scope-merging.ll: @test() fails because

the stack move optimization combines %dst and %src into one. This is a valid
optimization because they're never captured.

- test/Transforms/MemCpyOpt/callslot.ll: @write_src_between_call_and_memcpy()

fails because the stack move optimization combines %dest and %src into one.
This is a valid optimization since they're never captured.

- test/Transforms/MemCpyOpt/callslot_badaa.ll: @test() fails because the stack

move optimization combines %dst and %src into one. This is a valid optimization
since they're never captured.

- test/Other/new-pm(-lto|-thinlto)-defaults.ll: These tests fail because they

weren't expecting the new postdominator tree pass.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D140089

Files:
  llvm/include/llvm/Analysis/CaptureTracking.h
  llvm/include/llvm/Transforms/Scalar/MemCpyOptimizer.h
  llvm/lib/Analysis/CaptureTracking.cpp
  llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
  llvm/test/Transforms/MemCpyOpt/stack-move.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D140089.483095.patch
Type: text/x-patch
Size: 76215 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20221215/7b6a7182/attachment.bin>