[LLVMdev] Question about shouldMergeGEPs in InstructionCombining

Tue Feb 24 21:27:43 PST 2015

w----- Original Message -----
> From: "Francois Pichet" <pichet2000 at gmail.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Sent: Tuesday, February 24, 2015 6:17:22 AM
> Subject: Re: [LLVMdev] Question about shouldMergeGEPs in InstructionCombining
> 
> On Mon, Feb 23, 2015 at 2:17 PM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> 
> 
> ----- Original Message -----
> > From: "Francois Pichet" < pichet2000 at gmail.com >
> > To: "LLVM Developers Mailing List" < llvmdev at cs.uiuc.edu >
> > Sent: Sunday, February 22, 2015 5:34:11 PM
> > Subject: [LLVMdev] Question about shouldMergeGEPs in
> > InstructionCombining
> > 
> > Hello
> > 
> > I am not sure I understand the logic for merging GEPs in
> > InstructionCombining.cpp:
> > 
> > static bool shouldMergeGEPs (GEPOperator &GEP, GEPOperator &Src) {
> > // If this GEP has only 0 indices, it is the same pointer as
> > // Src. If Src is not a trivial GEP too, don't combine
> > // the indices.
> > if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices() &&
> > !Src.hasOneUse())
> > return false;
> > return true;
> > }
> > 
> > 
> > 
> > I have a case where
> > GEP: %arrayidx7 = getelementptr inbounds i32* %arrayidx, i32 %shl6
> > Src: %arrayidx = getelementptr inbounds [4096 x i32]* @phasor_4096,
> > i32 0, i32 %shl2
> > 
> > 
> > GEP. hasAllZeroIndices() will return false and the merge will occur
> > Why do we want to combine these 2 getelementptr?
> > 
> > 
> > On my out of tree target, combining these 2 GetElementPtr create a
> > performance regression because since GEP is in a loop (Src is out
> > of
> > loop), GEP will lower to a more complicated address for a
> > subsequent
> > load. (the complicated address needs to be calculated over and over
> > in the loop)
> > 
> 
> There are a couple of issues here. One, InstCombine's job is the move
> the IR toward a reasonable canonical form. It is doing that here,
> and I think that's the right thing to do. However, the problem you
> point out is a general one. Can you please explain why the
> MachineLICM pass is not able to hoist the redundant parts of the
> addressing computation out of the loop? We might also want to adjust
> for this in CodeGenPrep (CGP already has addressing-mode aware GEP
> splitting logic, although not quite for this case).
> 
> 
> 
> 
> Hi Hal,
> 
> 
> MachineLICM is not able to hoist anything because the address mode is
> not loop invariant.
> 
> 
> Here is a reduction of the code I am talking about.
> 
> 
> 
> extern const unsigned phasor[4096];
> void test(unsigned* out , unsigned step_size)
> {
> unsigned big_step_size = step_size<<2;
> int *phasor_ptr_temp_1 = &phasor[big_step_size];
> for (int i = 0 ; i < 1020 ; i+=4)
> out[i] = phasor_ptr_temp_1[i<<step_size];
> }
> 
> 
> I am getting slightly better code on my target (Octasic's Opus) if I
> return false for shouldMergeGEPs.
> I just tried with ppc64 and x86-64 and I am also getting better code
> without the GEP merging in InstructionCombining. I am not sure what
> the solution is yet but I think we are too aggressive when merging
> GEPs in InstructionCombining.
> 
> 
> Here is the details for ppc64: http://pastie.org/9978022
> 
> 

First, thanks for posting this, it is quite useful. LLVM trunk's PowerPC backend does a slightly better job now, but for an irrelevant reason, so to stick with the code you generated, we have:

.LBB0_1:                                # %for.body
                                        # =>This Inner Loop Header: Depth=1
	slw 8, 5, 4
	ld 9, .LC1 at toc@l(7)
	addi 5, 5, 4
	add 8, 8, 6
	extsw 8, 8
	sldi 8, 8, 2
	lwzx 8, 9, 8
	addi 9, 3, 16
	stw 8, 0(3)
	mr 3, 9
	bdnz .LBB0_1

there are two things wrong here, first:

	ld 9, .LC1 at toc@l(7)

this load is loop invariant, and should be hoisted (but was not). But more to your point, this:

	add 8, 8, 6

which represents this in your C code:

  int *phasor_ptr_temp_1 = &phasor[big_step_size];

has remained inside the loop. As you point out, the programmer even hoisted it for us, and we sunk it into the loop ourselves. The question is really this: How should we fix this? Generically, I believe that using fewer GEPs makes a cleaner canonical form for the IR as it passes through the mid-level optimizers, and we should split the GEPs to hoist invariant parts out of loops later in the pipeline (in or near CodeGenPrep, LSR, etc.) where we normally do addressing-mode-sensitive transformations.

 -Hal 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory