[LLVMdev] Memcpy expansion: InstCombine vs SelectionDAG

Sat Jan 11 06:13:42 PST 2014

Hi all,

We currently have code in InstCombine that tries to expand memcpy / memmove intrinsics that are copying something that fits in a single register to a load and store in the IR.  We then have other code in SelectionDAG that expands small memcpy intrinsics to sequences of loads and stores.  The InstCombine one is useful, as it exposes more optimisation opportunities, but unfortunately it is not semantically equivalent to the SelectionDAG expansion.

We're seeing this in our back end where a 32-bit aligned 64-bit memcpy is generates efficient code when expanded near the back end (2 loads, 2 stores), but quite suboptimal code when the load is a result of InstCombine expansion.  In the second case, it becomes two loads, a shift and an or (to construct the 64-bit value) and then a 

This simple program shows the problem:

struct A {
	int a, b;
} global;

int mv(const struct A *b)
{
	global = *b;
	return 1;
}

Compiled with clang at -O1 or above, the function becomes:

define i32 @mv(%struct.A* nocapture readonly %b) #0 {
entry:
  %0 = bitcast %struct.A* %b to i64*
  %1 = load i64* %0, align 4
  store i64 %1, i64* bitcast (%struct.A* @global to i64*), align 4
  ret i32 1
}

The back end now pattern matches on the load, but doesn't know that the load is only ever used for a corresponding store.  I can think of several ways of fixing this, but the ideal one is probably for InstCombine to know whether unaligned loads and stores are expensive for this target and emit a short sequence of loads and stores at the desired alignment if so.  This code in InstCombine could actually benefit a lot from some target knowledge, as this transform may be applicable on some systems to vector or other types (for example, I bet there are a lot of C++ structures that are copied around a lot that would fit happily in an AVX register).  As a small extension, if the operation can be more efficient if the alignment is better, and either the source or destination is on the stack, then it may make sense to increase the alignment of the alloca.  

A somewhat hacky alternative that would address this case would be to have a codegen prepare pass that would try to turn load-store sequences back into memcopy / memmove intrinsics (although the aliasing information present in the decision to use memmove may be lost from the load / store sequence), which we can then expand more efficiently in SelectionDAG.

Do we have a sensible way of exposing this kind of target-specific information to optimisers other than the vector cost model?

David