[LLVMdev] Question on Machine Combiner Pass

Fri Feb 6 12:12:43 PST 2015

----- Original Message -----
> From: "Ana Pazos" <apazos at codeaurora.org>
> To: "Hal Finkel" <hfinkel at anl.gov>, "Mandeep Singh Grang" <mgrang at codeaurora.org>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Friday, February 6, 2015 12:00:21 PM
> Subject: RE: [LLVMdev] Question on Machine Combiner Pass
> 
> Hi Hal,
> 
> We came across many missed opportunities to generate madd
> instructions in
> AArch64.
> 
> In this particular test case, a mul instruction that feeds into a
> madd
> instruction as its accumulator should have a reduced latency. But
> when
> dumping MachineCombiner decisions we do not see the reduced latency
> value
> (should be 1).
> 
> We debugged a bit and noticed the difference in how NewRoot and Root
> latency
> and depth are computed. We were not sure they were leading to the
> same
> queries to the SchedMod. You can see for example that in one instance
> it
> queries with the machine instruction pointer, in the other with only
> the
> machine instruction opcode:
> 
> TSchedModel.computeInstrLatency(Root)
> TSchedModel.computeInstrLatency(NewRoot->getOpcode())

I think that the idea is that what is currently done computes a minimal modification to the existing analysis. Perhaps this is not true, however.

> 
> We made a local change that adds the new candidate instructions to
> the basic
> block (with different output virtual register for the NewRoot),  and
> then
> ran the trace analysis on the modified BB. This allowed us to use the
> same
> code for computing latency and depth of both NewRoot and Root. Still
> we did
> not get the reduced latency value for the mult.
> 
> We asked Dave Estes to take a look at our reduced test and he may
> have found
> a bug in SchedAlias,. He will bring it up to the community in a
> separate
> discussion. With his fix now we can see the reduced mult latency and
> madd
> being generated for the reduced test case. We are verifying the fix
> with
> other complex tests.

Okay, sounds good.

> 
> But we still are left with this question about MachineCombiner:
> Shouldn't
> the trace analysis run on a modified basic block that contains the
> new
> candidate instructions and NewRoot/Root latency and depth be computed
> with
> the same SchedMod apis? In our local change we rerun the trace
> analysis for
> each candidate instruction we add to the basic block, and remove them
> from
> the basic block if they are not profitable.

I assume this was not done because of compile-time concerns. Does this scale as (N^2)*(# candidates) in the size of the BB?

> 
> Thanks,
> Ana.
> 
> 
> 
> -----Original Message-----
> From: llvmdev-bounces at cs.uiuc.edu
> [mailto:llvmdev-bounces at cs.uiuc.edu] On
> Behalf Of Hal Finkel
> Sent: Wednesday, February 04, 2015 5:33 PM
> To: Mandeep Singh Grang
> Cc: llvmdev at cs.uiuc.edu
> Subject: Re: [LLVMdev] Question on Machine Combiner Pass
> 
> ----- Original Message -----
> > From: "Mandeep Singh Grang" <mgrang at codeaurora.org>
> > To: llvmdev at cs.uiuc.edu
> > Sent: Wednesday, February 4, 2015 12:58:27 PM
> > Subject: Re: [LLVMdev] Question on Machine Combiner Pass
> > 
> > 
> > 
> > 
> > 
> > Ping
> > 
> > From: Mandeep Singh Grang [mailto:mgrang at codeaurora.org]
> > Sent: Tuesday, February 03, 2015 4:34 PM
> > To: 'llvmdev at cs.uiuc.edu'
> > Cc: 'ghoflehner at apple.com'; 'apazos at codeaurora.org';
> > mgrang at codeaurora.org
> > Subject: Question on Machine Combiner Pass
> > 
> > 
> > 
> > Hi,
> > 
> > 
> > 
> > In the file lib/CodeGen/MachineCombiner.cpp I see that in the
> > function
> > MachineCombiner::preservesCriticalPathLen
> > we try to determine whether the new combined instruction lengthens
> > the
> > critical path or not.
> > 
> > 
> > 
> > In order to do this we compute the depth and latency for the
> > current
> > instruction (MUL+ADD) and the alternate instruction (MADD).
> > 
> > But we call two different set of APIs for the current and new
> > instructions:
> > 
> > 
> > 
> > For new instruction we use:
> > 
> > unsigned NewRootDepth = getDepth(InsInstrs, InstrIdxForVirtReg,
> > BlockTrace);
> > 
> > unsigned NewRootLatency = getLatency(Root, NewRoot, BlockTrace);
> > 
> > 
> > 
> > While for the current instruction we use:
> > 
> > unsigned RootDepth = BlockTrace.getInstrCycles(Root).Depth;
> > 
> > unsigned RootLatency = TSchedModel.computeInstrLatency(Root);
> > 
> 
> The BlockTrace comes from MachineTraceMetrics, which is an analysis
> pass,
> and thus might have its data pre-computed. This only strictly applies
> to the
> current instruction sequence. We need to use a different method to
> compute
> information for potential new instructions. Are you finding that
> these
> methods compute inconsistent results?
> 
>  -Hal
> 
> > 
> > 
> > This is related to the following commit:
> > 
> > commit e4fa341dde3c9521b7f11bd53ecdcbeb3f8fcbda
> > 
> > Author: Gerolf Hoflehner < ghoflehner at apple.com >
> > 
> > Date: Thu Aug 7 21:40:58 2014 +0000
> > 
> > MachineCombiner Pass for selecting faster instruction sequence on
> > AArch64
> > 
> > 
> > For this example code sequence:
> > 
> > %mul = mul nuw nsw i32 %conv2, %conv
> > 
> > %mul7 = mul nuw nsw i32 %conv6, %conv4
> > 
> > %add = add nuw nsw i32 %mul7, %mul
> > 
> > ret i32 %add
> > 
> > 
> > 
> > We generate the following assembly:
> > mul w8, w0, w1
> > 
> > mul w9, w2, w3
> > 
> > add w0, w9, w8
> > 
> > ret
> > 
> > 
> > 
> > Whereas I expected the MUL+ADD to be combined to MADD otherwise I
> > see
> > degraded performance in several of my tests.
> > 
> > Could someone please explain why we use two different APIs to
> > compute
> > depth and latency for the two instructions?
> > 
> > Also if I try to use the same APIs for both then the depth for the
> > NewRoot is 0.
> > 
> > 
> > 
> > Thanks,
> > 
> > Mandeep
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory