[llvm-commits] [PATCH]IDIV->DIVB Atom Optimization

Wed Aug 22 11:46:32 PDT 2012

When you get a chance would you please review the changes.

Thanks,

Tyler

> -----Original Message-----
> From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-
> bounces at cs.uiuc.edu] On Behalf Of Nowicki, Tyler
> Sent: Wednesday, August 15, 2012 3:03 PM
> To: Evan Cheng
> Cc: llvm-commits at cs.uiuc.edu
> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom Optimization
> 
> Thanks for the review.
> 
> I made the changes you suggested, here is the latest patch and commit
> message.
> 
> Generic Bypass Slow Div
> -	CodeGenPrepare pass for identifying div/rem ops
> -	Backend specifies the type mapping using addBypassSlowDivType
> -	Enabled only on Intel Atom with O2 for 32-bit -> 8-bit
> -	Replace IDIV with instructions which test its value and use DIVB if the
> value is positive and less than 256.
> -	In the case when the quotient and remainder of a divide are used a
> DIV and a REM instruction will be present in the IR. In the non-Atom case
> they are both lowered to IDIVs and CSE removes the redundant IDIV
> instruction, using the quotient and remainder from the first IDIV. However,
> due to this optimization CSE is not able to eliminate redundant IDIV
> instructions because they are located in different basic blocks. This is
> overcome by calculating both the quotient (DIV) and remainder (REM) in
> each basic block that is inserted by the optimization and reusing the result
> values when a subsequent DIV or REM instruction uses the same operands.
> -	Test cases check for the presence of the optimization when
> calculating either the quotient, remainder,  or both.
> 
> Tyler
> 
> > -----Original Message-----
> > From: Evan Cheng [mailto:evan.cheng at apple.com]
> > Sent: Tuesday, August 14, 2012 5:02 PM
> > To: Nowicki, Tyler
> > Cc: llvm-commits at cs.uiuc.edu
> > Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom Optimization
> >
> > +// This file contains the runtime instruction selection optimization
> > +for signed // and unsigned integer division which selects 8-bit
> > +unsigned divide over a // 32-bit signed or unsigned divided when the
> > +dividend and divisor are positive // and less than 256. This is
> > +useful for architectures with slow dividers such // as some models of Intel
> Atom.
> >
> > Can you make it more generic? Perhaps lower 64-bit div to 32-bit or 32
> > to 16, etc.? For that matter, we should make the target hook more flexible.
> >
> > +  if (TLI && TLI->isSmallerDivIsCheaper()) {
> >
> > Rather than isSmallerDivIsCheaper(), perhaps a hook that take a type
> > and return a smaller type to lower the div to?
> >
> > +bool BypassSlowDivision(Function &F,
> > +                     Function::iterator &I,
> > +                     DivCacheTy &DivCache)
> >
> > The first character should be lower case to match the new LLVM naming
> > convention.
> >
> > Evan
> >
> > On Aug 13, 2012, at 7:35 AM, "Nowicki, Tyler"
> > <tyler.nowicki at intel.com>
> > wrote:
> >
> > > My apologies, that was not an SVN patch. Here is the correct patch.
> > >
> > > Tyler
> > >
> > >> -----Original Message-----
> > >> From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-
> > >> bounces at cs.uiuc.edu] On Behalf Of Nowicki, Tyler
> > >> Sent: Monday, August 13, 2012 10:21 AM
> > >> To: Evan Cheng
> > >> Cc: llvm-commits at cs.uiuc.edu
> > >> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom Optimization
> > >>
> > >> Sorry for the delayed response. This sounds like the right place.
> > >> I've attached a patch for a CodeGenPrep pass that bypasses slow
> > >> division operations when the target indicates that smaller
> > >> divisions are
> > cheaper.
> > >>
> > >> Thanks,
> > >>
> > >> Tyler
> > >>
> > >>> -----Original Message-----
> > >>> From: Evan Cheng [mailto:evan.cheng at apple.com]
> > >>> Sent: Sunday, August 05, 2012 7:01 PM
> > >>> To: Nowicki, Tyler
> > >>> Cc: llvm-commits at cs.uiuc.edu
> > >>> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom Optimization
> > >>>
> > >>> Another possibility would be to make this transformation generic.
> > >>> It might make sense to put it in CodeGenPrep under control of some
> > >>> target
> > >> hook.
> > >>> CodeGenPrep is the place where IR is being modified before it goes
> > >>> through codegen and it is allowed to do have some target specific
> > >> knowledge.
> > >>>
> > >>> Evan
> > >>>
> > >>> On Aug 5, 2012, at 11:10 AM, Evan Cheng wrote:
> > >>>
> > >>>>
> > >>>> On Aug 3, 2012, at 3:30 PM, Nowicki, Tyler wrote:
> > >>>>
> > >>>>> Let me clarify how this optimization works. Instead of
> > >>>>>
> > >>>>> result = dividend/divisor;
> > >>>>>
> > >>>>> an application programmer looking for optimal performance,
> > >>>>> targeting
> > >>> Atom CPUs, without this optimization would have to write:
> > >>>>>
> > >>>>> if (((dividend | divisor) & 0xFF) < 0) { result = (char)dividend
> > >>>>> / (char)divisor; //fast divide } else { result = dividend /
> > >>>>> divisor; //slow divide }
> > >>>>
> > >>>> Yes I understand. You want a simple division to be expanded into
> > >>>> code that
> > >>> involves multiple BB and control flows.
> > >>>>
> > >>>>>
> > >>>>> I still want the both the fast and slow divide to go through
> > >>>>> normal isel. But
> > >>> what you are suggesting significantly changes the way DIVs are
> > >>> handled for Atom CPUs, specifically avoiding the UDIVREM/SDIVREM
> > expansion!
> > >>> Are you suggesting that I duplicate this code in a custom inserter
> > function?
> > >>>>
> > >>>> No, you would custom lower sdiv / udiv / srem / urem nodes so
> > >>>> UDIVREM /
> > >>> SDIVREM code will never be formed.  The custom inserter function
> > >>> create MachineInstr, not SDNodes.
> > >>>>
> > >>>> Evan
> > >>>>
> > >>>>>
> > >>>>> Tyler
> > >>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: Evan Cheng [mailto:evan.cheng at apple.com]
> > >>>>>> Sent: Friday, August 03, 2012 6:07 PM
> > >>>>>> To: Nowicki, Tyler
> > >>>>>> Cc: llvm-commits at cs.uiuc.edu
> > >>>>>> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom Optimization
> > >>>>>>
> > >>>>>>
> > >>>>>> On Aug 3, 2012, at 2:28 PM, Nowicki, Tyler wrote:
> > >>>>>>
> > >>>>>>> This optimization should not be classified as `instruction
> > >>>>>>> selection'. We're not
> > >>>>>> simply replacing one form of divide with another at compilation
> > >>>>>> time. The optimization adds control flow to use an 8-bit DIV
> > >>>>>> when the values of the operands are determined to be small
> > >>>>>> enough during
> > >>> program execution.
> > >>>>>>
> > >>>>>> It is instruction selection. This is about selecting the
> > >>>>>> optimal code sequence to implement an operation.
> > >>>>>>
> > >>>>>>>
> > >>>>>>> Several approaches were looked at for this optimization and
> > >>>>>>> the IR pass was
> > >>>>>> found to be the best way to accomplish it. Here are the
> > >>>>>> approaches that we looked at and the reasons they are not
> > >>>>>> suitable for this
> > >>> optimization.
> > >>>>>>>
> > >>>>>>> 1. Using a custom lowering function Specifying a custom
> > >>>>>>> lowering function for ISD::UDIV/UREM/etc instructions
> > >>>>>> using setOperationAction(...) in X86ISelLowering. However, the
> > >>>>>> lowering functions are limited to using a SelectionDAG. Because
> > >>>>>> the optimization requires several new BBs to be added it is not
> > >>>>>> possible to add it here. Is this the place you were thinking of?
> > >>>>>>>
> > >>>>>>> 2. Using a custom inserter
> > >>>>>>> Marking appropriate instructions with usesCustomInserter in
> > >>>>>>> tablegen and
> > >>>>>> implementing the optimization by inserting several new
> > >>>>>> MachineBasicBlocks. As said previously ISD::UDIVREM/SDIVREM
> are
> > >>>>>> expanded into multiple instructions, see X86ISelDAGToDAG:2301.
> > >>>>>> Adding it here would require the inserted instructions to be
> > >>>>>> identified, removed, and a similar set of instructions to be
> > >>>>>> added for the fast and slow DIV cases. However, this cannot be
> > >>>>>> easily done because there are no guarantees how
> > UDIVREM/SDIVREM
> > >>>>>> will be
> > >>> expanded
> > >>>>>> or in what order, and inserting the fast and slow DIVs here
> > >>>>>> would
> > >>> essentially require duplicating the UDIVREM/SDIVREM expansion code.
> > >>>>>>
> > >>>>>> You want to use combination of 1 and 2. Custom lower into a
> > >>>>>> target specific node which will map to a pseudo instruction
> > >>>>>> marked with
> > >>> usesCustomInserter.
> > >>>>>> That allows you to create new BBs. There are lots of examples, e.g.
> > >>> atomic ops.
> > >>>>>>
> > >>>>>> Evan
> > >>>>>>
> > >>>>>>>
> > >>>>>>> 3. Other approaches
> > >>>>>>> We want the UDIVREM/SDIVREM expansion to operate on the
> > >> original
> > >>>>>>> DIV
> > >>>>>> instruction and the DIV this optimization inserts so it seems
> > >>>>>> like this optimization could be implemented in
> > >>>>>> X86DAGToDAGISel::PreprocessISelDAG. Can we add control flow at
> > >>>>>> this
> > >>> point?
> > >>>>>>>
> > >>>>>>> Tyler Nowicki
> > >>>>>>> Intel
> > >>>>>>>
> > >>>>>>>> -----Original Message-----
> > >>>>>>>> From: Evan Cheng [mailto:evan.cheng at apple.com]
> > >>>>>>>> Sent: Wednesday, August 01, 2012 4:38 PM
> > >>>>>>>> To: Nowicki, Tyler
> > >>>>>>>> Cc: llvm-commits at cs.uiuc.edu
> > >>>>>>>> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom
> > Optimization
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Aug 1, 2012, at 11:58 AM, "Nowicki, Tyler"
> > >>>>>>>> <tyler.nowicki at intel.com>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> I don't understand what you are suggesting.
> > >>>>>>>>>
> > >>>>>>>>> ISD::SDIV, ISD::SREM and ISD::UDIV, ISD::UREM are combined
> > >>>>>>>>> to form
> > >>>>>>>> ISD::SDIVREM, and ISD::UDIVREM. These are handled by
> > >>>>>>>> X86ISelDAGToDAG:2301 which takes care of all the special
> > >>>>>>>> cases by inserting the appropriate movs, cdq/cwd, and
> > >>>>>>>> X86::DIV or X86::IDIV instructions. A custom inserter is
> > >>>>>>>> invoked after this, but it's too late because the div/rem
> > >>>>>>>> operations have been replaced by numerous
> > >>>>>> instructions.
> > >>>>>>>>
> > >>>>>>>> Looks like you don't understand how LLVM instruction works.
> > >>>>>>>> Please spend some time looking at the passes in isel,
> > >>>>>>>> especially how targets do custom lowering. Basically you want
> > >>>>>>>> to lower these ISD nodes before they reach X86ISelDAGToDAG.
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> What is wrong with doing this in a backend IR pass?
> > >>>>>>>>
> > >>>>>>>> Because this is a target specific instruction selection issue
> > >>>>>>>> and it doesn't belong in LLVM IR level. We don't want
> > >>>>>>>> arbitrary IR passes to do bits and pieces of optimization.
> > >>>>>>>>
> > >>>>>>>> Evan
> > >>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Tyler
> > >>>>>>>>>
> > >>>>>>>>>> -----Original Message-----
> > >>>>>>>>>> From: Evan Cheng [mailto:evan.cheng at apple.com]
> > >>>>>>>>>> Sent: Wednesday, August 01, 2012 2:18 PM
> > >>>>>>>>>> To: Nowicki, Tyler
> > >>>>>>>>>> Cc: llvm-commits at cs.uiuc.edu
> > >>>>>>>>>> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom
> > >>>>>>>>>> Optimization
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Aug 1, 2012, at 9:18 AM, "Nowicki, Tyler"
> > >>>>>>>>>> <tyler.nowicki at intel.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Thanks for the reply.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Lower is too late because DIV gets expanded into 2 MOV, a
> > >>>>>>>>>>> CDQ or CWD, the DIV or IDIV, and one or two more MOVs
> > during
> > >>>>>>>>>>> instruction selection on DAG conversion in
> X86ISelDAGToDAG.
> > >>>>>>>>>>> Adding
> > >>>>>>>>>>
> > >>>>>>>>>> Not during X86ISelDAGToDAG. You should do this using
> custom
> > >>>>>>>>>> lowering in X86ISelLowering. At that time, it just looks
> > >>>>>>>>>> like a ISD::SDIV /
> > >>>>>> UDIV node.
> > >>>>>>>>>>
> > >>>>>>>>>>> this optimization after would require a custom inserter to
> > >>>>>>>>>>> search and remove
> > >>>>>>>>>> these instructions, and generate new BBs. This is
> > >>>>>>>>>> complicated by
> > >>>>>>>>>> O2 optimization which may remove MOVs when possible, by
> the
> > >>>>>>>>>> scheduler on Atom which changes the order of instructions
> > >>>>>>>>>> to avoid stalls, and by which form of the DIV instruction is
> used.
> > >>>>>>>>>> Also, depending on the spills and neighboring instructions
> > >>>>>>>>>> it is difficult to enumerate all the possible ways this
> > >>>>>>>>>> optimization can be
> > >>> carried out.
> > >>>>>>>>>>
> > >>>>>>>>>> The right thing to do is to custom lower into a X86ISD
> > >>>>>>>>>> target node and select to a pseudo instruction. It can be
> > >>>>>>>>>> marked with usesCustomInserter = 1 which allows you to
> > >>>>>>>>>> expand it into the machine
> > >>>>>>>> instructions and create new BBs if needed.
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> It makes sense to do the optimization as an IR pass
> > >>>>>>>>>>> because none of this has
> > >>>>>>>>>> happened yet. Then CSE and scheduling can operate on the
> > >>>>>>>>>> new
> > >>> BBs.
> > >>>>>>>>>>
> > >>>>>>>>>> Scheduling doesn't happen until later and machine CSE can
> > >>>>>>>>>> handle these instructions. Please look at some examples to
> > >>>>>>>>>> learn more about how existing targets handle complex cases.
> > >>>>>>>>>>
> > >>>>>>>>>> Evan
> > >>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Tyler
> > >>>>>>>>>>>
> > >>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>> From: Evan Cheng [mailto:evan.cheng at apple.com]
> > >>>>>>>>>>>> Sent: Tuesday, July 31, 2012 6:41 PM
> > >>>>>>>>>>>> To: Nowicki, Tyler
> > >>>>>>>>>>>> Cc: Eli Friedman; llvm-commits at cs.uiuc.edu
> > >>>>>>>>>>>> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom
> > >>>>>>>>>>>> Optimization
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi Tyler,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This doesn't look like the right approach. I don't
> > >>>>>>>>>>>> understand why you are adding a LLVM ir pass to do the
> > >>>>>>>>>>>> expansion. Why can't this be done at x86 instruction
> > >>>>>>>>>>>> lowering
> > >> time?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Evan
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Jul 31, 2012, at 1:25 PM, "Nowicki, Tyler"
> > >>>>>>>>>>>> <tyler.nowicki at intel.com>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for the review. Good catch on the missing check,
> > >>>>>>>>>>>>> I've also made use
> > >>>>>>>>>>>> of DenseMap as suggested. Test cases omitted from the
> > >>>>>>>>>>>> previously patch are in place again and a test case for
> > >>>>>>>>>>>> the missing check you found has been added.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Please review my corrections.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Tyler Nowicki
> > >>>>>>>>>>>>> Intel
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -----Original Message-----
> > >>>>>>>>>>>>>> From: Eli Friedman [mailto:eli.friedman at gmail.com]
> > >>>>>>>>>>>>>> Sent: Friday, July 27, 2012 2:17 PM
> > >>>>>>>>>>>>>> To: Nowicki, Tyler
> > >>>>>>>>>>>>>> Cc: Sean Silva; llvm-commits at cs.uiuc.edu
> > >>>>>>>>>>>>>> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom
> > >>>>>>>>>>>>>> Optimization
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Fri, Jul 27, 2012 at 8:35 AM, Nowicki, Tyler
> > >>>>>>>>>>>>>> <tyler.nowicki at intel.com>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>> This patch has not yet been reviewed. Could someone
> > take
> > >>>>>>>>>>>>>>> a look at
> > >>>>>>>> it?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> +    typedef std::pair<Value *, Value *> DivBOperands;
> > >>>>>>>>>>>>>> + //Dividend
> > >>>>>>>>>>>>>> Value, Divisor Value
> > >>>>>>>>>>>>>> +    typedef std::pair<PHINode *, PHINode *>
> > >>>>>>>>>>>>>> + DivBPhiNodes; //Quotient
> > >>>>>>>>>>>>>> PHI, Remainder Phi
> > >>>>>>>>>>>>>> +    typedef std::map<DivBOperands, DivBPhiNodes>
> > >>>>>>>>>>>>>> + DivBCacheTy;
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> llvm::DenseMap?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> +  // Replace operation value with previously generated
> > >>>>>>>>>>>>>> + phi node DivBPhiNodes Value = CacheI->second;  if
> > >> (UseDivOp) {
> > >>>>>>>>>>>>>> +    // Replace all uses of div instruction with
> > >>>>>>>>>>>>>> + quotient phi
> > >> node
> > >>>>>>>>>>>>>> +    J->replaceAllUsesWith(Value.first);
> > >>>>>>>>>>>>>> +  } else {
> > >>>>>>>>>>>>>> +    // Replace all uses of rem instruction with
> > >>>>>>>>>>>>>> + remainder phi
> > >>> node
> > >>>>>>>>>>>>>> +    J->replaceAllUsesWith(Value.second);
> > >>>>>>>>>>>>>> +  }
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Looks like there's a missing check for UseSignedOp here.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The patch doesn't appear to include any testcases.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> -Eli
> > >>>>>>>>>>>>> <idivtodivb-
> > >>>>>>>>>>>>
> > >>>>>>
> svn.patch>_______________________________________________
> > >>>>>>>>>>>>> llvm-commits mailing list llvm-commits at cs.uiuc.edu
> > >>>>>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>> _______________________________________________
> > >>>> llvm-commits mailing list
> > >>>> llvm-commits at cs.uiuc.edu
> > >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > >
> > > <BypassSlowDiv5-svn.patch>