[LLVMdev] Question about shouldMergeGEPs in InstructionCombining

Wed Feb 25 13:41:32 PST 2015

----- Original Message -----
> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "Francois Pichet" <pichet2000 at gmail.com>
> Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>, "chandlerc" <chandlerc at gmail.com>
> Sent: Tuesday, February 24, 2015 11:27:43 PM
> Subject: Re: [LLVMdev] Question about shouldMergeGEPs in InstructionCombining
> 
> w----- Original Message -----
> > From: "Francois Pichet" <pichet2000 at gmail.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> > Sent: Tuesday, February 24, 2015 6:17:22 AM
> > Subject: Re: [LLVMdev] Question about shouldMergeGEPs in
> > InstructionCombining
> > 
> > On Mon, Feb 23, 2015 at 2:17 PM, Hal Finkel < hfinkel at anl.gov >
> > wrote:
> > 
> > 
> > ----- Original Message -----
> > > From: "Francois Pichet" < pichet2000 at gmail.com >
> > > To: "LLVM Developers Mailing List" < llvmdev at cs.uiuc.edu >
> > > Sent: Sunday, February 22, 2015 5:34:11 PM
> > > Subject: [LLVMdev] Question about shouldMergeGEPs in
> > > InstructionCombining
> > > 
> > > Hello
> > > 
> > > I am not sure I understand the logic for merging GEPs in
> > > InstructionCombining.cpp:
> > > 
> > > static bool shouldMergeGEPs (GEPOperator &GEP, GEPOperator &Src)
> > > {
> > > // If this GEP has only 0 indices, it is the same pointer as
> > > // Src. If Src is not a trivial GEP too, don't combine
> > > // the indices.
> > > if (GEP.hasAllZeroIndices() && !Src.hasAllZeroIndices() &&
> > > !Src.hasOneUse())
> > > return false;
> > > return true;
> > > }
> > > 
> > > 
> > > 
> > > I have a case where
> > > GEP: %arrayidx7 = getelementptr inbounds i32* %arrayidx, i32
> > > %shl6
> > > Src: %arrayidx = getelementptr inbounds [4096 x i32]*
> > > @phasor_4096,
> > > i32 0, i32 %shl2
> > > 
> > > 
> > > GEP. hasAllZeroIndices() will return false and the merge will
> > > occur
> > > Why do we want to combine these 2 getelementptr?
> > > 
> > > 
> > > On my out of tree target, combining these 2 GetElementPtr create
> > > a
> > > performance regression because since GEP is in a loop (Src is out
> > > of
> > > loop), GEP will lower to a more complicated address for a
> > > subsequent
> > > load. (the complicated address needs to be calculated over and
> > > over
> > > in the loop)
> > > 
> > 
> > There are a couple of issues here. One, InstCombine's job is the
> > move
> > the IR toward a reasonable canonical form. It is doing that here,
> > and I think that's the right thing to do. However, the problem you
> > point out is a general one. Can you please explain why the
> > MachineLICM pass is not able to hoist the redundant parts of the
> > addressing computation out of the loop? We might also want to
> > adjust
> > for this in CodeGenPrep (CGP already has addressing-mode aware GEP
> > splitting logic, although not quite for this case).
> > 
> > 
> > 
> > 
> > Hi Hal,
> > 
> > 
> > MachineLICM is not able to hoist anything because the address mode
> > is
> > not loop invariant.
> > 
> > 
> > Here is a reduction of the code I am talking about.
> > 
> > 
> > 
> > extern const unsigned phasor[4096];
> > void test(unsigned* out , unsigned step_size)
> > {
> > unsigned big_step_size = step_size<<2;
> > int *phasor_ptr_temp_1 = &phasor[big_step_size];
> > for (int i = 0 ; i < 1020 ; i+=4)
> > out[i] = phasor_ptr_temp_1[i<<step_size];
> > }
> > 
> > 
> > I am getting slightly better code on my target (Octasic's Opus) if
> > I
> > return false for shouldMergeGEPs.
> > I just tried with ppc64 and x86-64 and I am also getting better
> > code
> > without the GEP merging in InstructionCombining. I am not sure what
> > the solution is yet but I think we are too aggressive when merging
> > GEPs in InstructionCombining.
> > 
> > 
> > Here is the details for ppc64: http://pastie.org/9978022
> > 
> > 
> 
> First, thanks for posting this, it is quite useful. LLVM trunk's
> PowerPC backend does a slightly better job now, but for an
> irrelevant reason, so to stick with the code you generated, we have:
> 
> .LBB0_1:                                # %for.body
>                                         # =>This Inner Loop Header:
>                                         Depth=1
> 	slw 8, 5, 4
> 	ld 9, .LC1 at toc@l(7)
> 	addi 5, 5, 4
> 	add 8, 8, 6
> 	extsw 8, 8
> 	sldi 8, 8, 2
> 	lwzx 8, 9, 8
> 	addi 9, 3, 16
> 	stw 8, 0(3)
> 	mr 3, 9
> 	bdnz .LBB0_1
> 
> there are two things wrong here, first:
> 
> 	ld 9, .LC1 at toc@l(7)
> 
> this load is loop invariant, and should be hoisted (but was not).

The problem of the non-hoisted TOC load was a backend deficiency, now fixed in r230553. Thanks for providing this example. I'll now give some more thought to the GEP splitting issue.

 -Hal

> But
> more to your point, this:
> 
> 	add 8, 8, 6
> 
> which represents this in your C code:
> 
>   int *phasor_ptr_temp_1 = &phasor[big_step_size];
> 
> has remained inside the loop. As you point out, the programmer even
> hoisted it for us, and we sunk it into the loop ourselves. The
> question is really this: How should we fix this? Generically, I
> believe that using fewer GEPs makes a cleaner canonical form for the
> IR as it passes through the mid-level optimizers, and we should
> split the GEPs to hoist invariant parts out of loops later in the
> pipeline (in or near CodeGenPrep, LSR, etc.) where we normally do
> addressing-mode-sensitive transformations.
> 
>  -Hal
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory