[LLVMdev] Memcpy expansion: InstCombine vs SelectionDAG
    Quentin Colombet 
    qcolombet at apple.com
       
    Tue Jan 14 12:42:15 PST 2014
    
    
  
Hi David,
On Jan 11, 2014, at 6:13 AM, David Chisnall <David.Chisnall at cl.cam.ac.uk> wrote:
> Hi all,
> 
> We currently have code in InstCombine that tries to expand memcpy / memmove intrinsics that are copying something that fits in a single register to a load and store in the IR.  We then have other code in SelectionDAG that expands small memcpy intrinsics to sequences of loads and stores.  The InstCombine one is useful, as it exposes more optimisation opportunities, but unfortunately it is not semantically equivalent to the SelectionDAG expansion.
Could you be more specific?
In particular, what is the problem?
> 
> We're seeing this in our back end where a 32-bit aligned 64-bit memcpy is generates efficient code when expanded near the back end (2 loads, 2 stores), but quite suboptimal code when the load is a result of InstCombine expansion.  In the second case, it becomes two loads, a shift and an or (to construct the 64-bit value) and then a 
We are missing the interesting bits here :).
What is next?
A 64-bit store?
> 
> This simple program shows the problem:
> 
> struct A {
> 	int a, b;
> } global;
> 
> int mv(const struct A *b)
> {
> 	global = *b;
> 	return 1;
> }
> 
> Compiled with clang at -O1 or above, the function becomes:
> 
> define i32 @mv(%struct.A* nocapture readonly %b) #0 {
> entry:
>  %0 = bitcast %struct.A* %b to i64*
>  %1 = load i64* %0, align 4
>  store i64 %1, i64* bitcast (%struct.A* @global to i64*), align 4
>  ret i32 1
> }
> 
> The back end now pattern matches on the load, but doesn't know that the load is only ever used for a corresponding store.
I am not sure I follow what you are saying.
In this example, the load (%1) is used only once and in the same basic block. You should have every information you need in SelectionDAG.
That said, as you may know, SelectionDAG is a bottom-up approach. Thus, store is matched before the load gets a change to be matched.
It is hard to give direction on how to fix that since we do not know anything about the target. In particular, is i64 a legal type or not?
Based on your problem description, it is not clear since we are missing what the 2 loads are feeding but it sounds like they are feeding a 64-bit store.
In short, what is the best lowering you would expect for your target?
1. 2 loads - 2 store?
2. A memcpy intrinsic?
For #1, I am wondering if we could extend the slicing that takes place during DAG combine (look for SliceUpLoad in DAGCombiner) to slice the store part too if that is the problem.
>  I can think of several ways of fixing this, but the ideal one is probably for InstCombine to know whether unaligned loads and stores are expensive for this target and emit a short sequence of loads and stores at the desired alignment if so.  This code in InstCombine could actually benefit a lot from some target knowledge, as this transform may be applicable on some systems to vector or other types (for example, I bet there are a lot of C++ structures that are copied around a lot that would fit happily in an AVX register).
As far as I can tell, InstCombine is target independent and its purpose is to produce a canonical form. In that case, memcpy/memmove of chunk that are less that 8 bytes are canonicalized to 64-bit load/store.
> As a small extension, if the operation can be more efficient if the alignment is better, and either the source or destination is on the stack, then it may make sense to increase the alignment of the alloca.  
> 
> A somewhat hacky alternative that would address this case would be to have a codegen prepare pass that would try to turn load-store sequences back into memcopy / memmove intrinsics (although the aliasing information present in the decision to use memmove may be lost from the load / store sequence), which we can then expand more efficiently in SelectionDAG.
> 
> Do we have a sensible way of exposing this kind of target-specific information to optimisers other than the vector cost model?
We usually use the TargetLowering class, you can look into the CodeGenPrepare pass :).
Hope that helps.
-Quentin
> 
> David
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
    
    
More information about the llvm-dev
mailing list