[llvm-commits] [PATCH]IDIV->DIVB Atom Optimization

Tue Aug 14 14:01:33 PDT 2012

+// This file contains the runtime instruction selection optimization for signed
+// and unsigned integer division which selects 8-bit unsigned divide over a
+// 32-bit signed or unsigned divided when the dividend and divisor are positive
+// and less than 256. This is useful for architectures with slow dividers such
+// as some models of Intel Atom.

Can you make it more generic? Perhaps lower 64-bit div to 32-bit or 32 to 16, etc.? For that matter, we should make the target hook more flexible.

+  if (TLI && TLI->isSmallerDivIsCheaper()) {

Rather than isSmallerDivIsCheaper(), perhaps a hook that take a type and return a smaller type to lower the div to?

+bool BypassSlowDivision(Function &F,
+                     Function::iterator &I,
+                     DivCacheTy &DivCache)

The first character should be lower case to match the new LLVM naming convention.

Evan

On Aug 13, 2012, at 7:35 AM, "Nowicki, Tyler" <tyler.nowicki at intel.com> wrote:

> My apologies, that was not an SVN patch. Here is the correct patch.
> 
> Tyler
> 
>> -----Original Message-----
>> From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-
>> bounces at cs.uiuc.edu] On Behalf Of Nowicki, Tyler
>> Sent: Monday, August 13, 2012 10:21 AM
>> To: Evan Cheng
>> Cc: llvm-commits at cs.uiuc.edu
>> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom Optimization
>> 
>> Sorry for the delayed response. This sounds like the right place. I've attached
>> a patch for a CodeGenPrep pass that bypasses slow division operations when
>> the target indicates that smaller divisions are cheaper.
>> 
>> Thanks,
>> 
>> Tyler
>> 
>>> -----Original Message-----
>>> From: Evan Cheng [mailto:evan.cheng at apple.com]
>>> Sent: Sunday, August 05, 2012 7:01 PM
>>> To: Nowicki, Tyler
>>> Cc: llvm-commits at cs.uiuc.edu
>>> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom Optimization
>>> 
>>> Another possibility would be to make this transformation generic. It
>>> might make sense to put it in CodeGenPrep under control of some target
>> hook.
>>> CodeGenPrep is the place where IR is being modified before it goes
>>> through codegen and it is allowed to do have some target specific
>> knowledge.
>>> 
>>> Evan
>>> 
>>> On Aug 5, 2012, at 11:10 AM, Evan Cheng wrote:
>>> 
>>>> 
>>>> On Aug 3, 2012, at 3:30 PM, Nowicki, Tyler wrote:
>>>> 
>>>>> Let me clarify how this optimization works. Instead of
>>>>> 
>>>>> result = dividend/divisor;
>>>>> 
>>>>> an application programmer looking for optimal performance,
>>>>> targeting
>>> Atom CPUs, without this optimization would have to write:
>>>>> 
>>>>> if (((dividend | divisor) & 0xFF) < 0) { result = (char)dividend /
>>>>> (char)divisor; //fast divide } else { result = dividend / divisor;
>>>>> //slow divide }
>>>> 
>>>> Yes I understand. You want a simple division to be expanded into
>>>> code that
>>> involves multiple BB and control flows.
>>>> 
>>>>> 
>>>>> I still want the both the fast and slow divide to go through normal
>>>>> isel. But
>>> what you are suggesting significantly changes the way DIVs are handled
>>> for Atom CPUs, specifically avoiding the UDIVREM/SDIVREM expansion!
>>> Are you suggesting that I duplicate this code in a custom inserter function?
>>>> 
>>>> No, you would custom lower sdiv / udiv / srem / urem nodes so
>>>> UDIVREM /
>>> SDIVREM code will never be formed.  The custom inserter function
>>> create MachineInstr, not SDNodes.
>>>> 
>>>> Evan
>>>> 
>>>>> 
>>>>> Tyler
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Evan Cheng [mailto:evan.cheng at apple.com]
>>>>>> Sent: Friday, August 03, 2012 6:07 PM
>>>>>> To: Nowicki, Tyler
>>>>>> Cc: llvm-commits at cs.uiuc.edu
>>>>>> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom Optimization
>>>>>> 
>>>>>> 
>>>>>> On Aug 3, 2012, at 2:28 PM, Nowicki, Tyler wrote:
>>>>>> 
>>>>>>> This optimization should not be classified as `instruction
>>>>>>> selection'. We're not
>>>>>> simply replacing one form of divide with another at compilation
>>>>>> time. The optimization adds control flow to use an 8-bit DIV when
>>>>>> the values of the operands are determined to be small enough
>>>>>> during
>>> program execution.
>>>>>> 
>>>>>> It is instruction selection. This is about selecting the optimal
>>>>>> code sequence to implement an operation.
>>>>>> 
>>>>>>> 
>>>>>>> Several approaches were looked at for this optimization and the
>>>>>>> IR pass was
>>>>>> found to be the best way to accomplish it. Here are the approaches
>>>>>> that we looked at and the reasons they are not suitable for this
>>> optimization.
>>>>>>> 
>>>>>>> 1. Using a custom lowering function Specifying a custom lowering
>>>>>>> function for ISD::UDIV/UREM/etc instructions
>>>>>> using setOperationAction(...) in X86ISelLowering. However, the
>>>>>> lowering functions are limited to using a SelectionDAG. Because
>>>>>> the optimization requires several new BBs to be added it is not
>>>>>> possible to add it here. Is this the place you were thinking of?
>>>>>>> 
>>>>>>> 2. Using a custom inserter
>>>>>>> Marking appropriate instructions with usesCustomInserter in
>>>>>>> tablegen and
>>>>>> implementing the optimization by inserting several new
>>>>>> MachineBasicBlocks. As said previously ISD::UDIVREM/SDIVREM are
>>>>>> expanded into multiple instructions, see X86ISelDAGToDAG:2301.
>>>>>> Adding it here would require the inserted instructions to be
>>>>>> identified, removed, and a similar set of instructions to be added
>>>>>> for the fast and slow DIV cases. However, this cannot be easily
>>>>>> done because there are no guarantees how UDIVREM/SDIVREM will be
>>> expanded
>>>>>> or in what order, and inserting the fast and slow DIVs here would
>>> essentially require duplicating the UDIVREM/SDIVREM expansion code.
>>>>>> 
>>>>>> You want to use combination of 1 and 2. Custom lower into a target
>>>>>> specific node which will map to a pseudo instruction marked with
>>> usesCustomInserter.
>>>>>> That allows you to create new BBs. There are lots of examples, e.g.
>>> atomic ops.
>>>>>> 
>>>>>> Evan
>>>>>> 
>>>>>>> 
>>>>>>> 3. Other approaches
>>>>>>> We want the UDIVREM/SDIVREM expansion to operate on the
>> original
>>>>>>> DIV
>>>>>> instruction and the DIV this optimization inserts so it seems like
>>>>>> this optimization could be implemented in
>>>>>> X86DAGToDAGISel::PreprocessISelDAG. Can we add control flow at
>>>>>> this
>>> point?
>>>>>>> 
>>>>>>> Tyler Nowicki
>>>>>>> Intel
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Evan Cheng [mailto:evan.cheng at apple.com]
>>>>>>>> Sent: Wednesday, August 01, 2012 4:38 PM
>>>>>>>> To: Nowicki, Tyler
>>>>>>>> Cc: llvm-commits at cs.uiuc.edu
>>>>>>>> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom Optimization
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Aug 1, 2012, at 11:58 AM, "Nowicki, Tyler"
>>>>>>>> <tyler.nowicki at intel.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I don't understand what you are suggesting.
>>>>>>>>> 
>>>>>>>>> ISD::SDIV, ISD::SREM and ISD::UDIV, ISD::UREM are combined to
>>>>>>>>> form
>>>>>>>> ISD::SDIVREM, and ISD::UDIVREM. These are handled by
>>>>>>>> X86ISelDAGToDAG:2301 which takes care of all the special cases
>>>>>>>> by inserting the appropriate movs, cdq/cwd, and X86::DIV or
>>>>>>>> X86::IDIV instructions. A custom inserter is invoked after this,
>>>>>>>> but it's too late because the div/rem operations have been
>>>>>>>> replaced by numerous
>>>>>> instructions.
>>>>>>>> 
>>>>>>>> Looks like you don't understand how LLVM instruction works.
>>>>>>>> Please spend some time looking at the passes in isel, especially
>>>>>>>> how targets do custom lowering. Basically you want to lower
>>>>>>>> these ISD nodes before they reach X86ISelDAGToDAG.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> What is wrong with doing this in a backend IR pass?
>>>>>>>> 
>>>>>>>> Because this is a target specific instruction selection issue
>>>>>>>> and it doesn't belong in LLVM IR level. We don't want arbitrary
>>>>>>>> IR passes to do bits and pieces of optimization.
>>>>>>>> 
>>>>>>>> Evan
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Tyler
>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Evan Cheng [mailto:evan.cheng at apple.com]
>>>>>>>>>> Sent: Wednesday, August 01, 2012 2:18 PM
>>>>>>>>>> To: Nowicki, Tyler
>>>>>>>>>> Cc: llvm-commits at cs.uiuc.edu
>>>>>>>>>> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom
>>>>>>>>>> Optimization
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Aug 1, 2012, at 9:18 AM, "Nowicki, Tyler"
>>>>>>>>>> <tyler.nowicki at intel.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks for the reply.
>>>>>>>>>>> 
>>>>>>>>>>> Lower is too late because DIV gets expanded into 2 MOV, a CDQ
>>>>>>>>>>> or CWD, the DIV or IDIV, and one or two more MOVs during
>>>>>>>>>>> instruction selection on DAG conversion in X86ISelDAGToDAG.
>>>>>>>>>>> Adding
>>>>>>>>>> 
>>>>>>>>>> Not during X86ISelDAGToDAG. You should do this using custom
>>>>>>>>>> lowering in X86ISelLowering. At that time, it just looks like
>>>>>>>>>> a ISD::SDIV /
>>>>>> UDIV node.
>>>>>>>>>> 
>>>>>>>>>>> this optimization after would require a custom inserter to
>>>>>>>>>>> search and remove
>>>>>>>>>> these instructions, and generate new BBs. This is complicated
>>>>>>>>>> by
>>>>>>>>>> O2 optimization which may remove MOVs when possible, by the
>>>>>>>>>> scheduler on Atom which changes the order of instructions to
>>>>>>>>>> avoid stalls, and by which form of the DIV instruction is used.
>>>>>>>>>> Also, depending on the spills and neighboring instructions it
>>>>>>>>>> is difficult to enumerate all the possible ways this
>>>>>>>>>> optimization can be
>>> carried out.
>>>>>>>>>> 
>>>>>>>>>> The right thing to do is to custom lower into a X86ISD target
>>>>>>>>>> node and select to a pseudo instruction. It can be marked with
>>>>>>>>>> usesCustomInserter = 1 which allows you to expand it into the
>>>>>>>>>> machine
>>>>>>>> instructions and create new BBs if needed.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> It makes sense to do the optimization as an IR pass because
>>>>>>>>>>> none of this has
>>>>>>>>>> happened yet. Then CSE and scheduling can operate on the new
>>> BBs.
>>>>>>>>>> 
>>>>>>>>>> Scheduling doesn't happen until later and machine CSE can
>>>>>>>>>> handle these instructions. Please look at some examples to
>>>>>>>>>> learn more about how existing targets handle complex cases.
>>>>>>>>>> 
>>>>>>>>>> Evan
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Tyler
>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Evan Cheng [mailto:evan.cheng at apple.com]
>>>>>>>>>>>> Sent: Tuesday, July 31, 2012 6:41 PM
>>>>>>>>>>>> To: Nowicki, Tyler
>>>>>>>>>>>> Cc: Eli Friedman; llvm-commits at cs.uiuc.edu
>>>>>>>>>>>> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom
>>>>>>>>>>>> Optimization
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Tyler,
>>>>>>>>>>>> 
>>>>>>>>>>>> This doesn't look like the right approach. I don't
>>>>>>>>>>>> understand why you are adding a LLVM ir pass to do the
>>>>>>>>>>>> expansion. Why can't this be done at x86 instruction lowering
>> time?
>>>>>>>>>>>> 
>>>>>>>>>>>> Evan
>>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 31, 2012, at 1:25 PM, "Nowicki, Tyler"
>>>>>>>>>>>> <tyler.nowicki at intel.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for the review. Good catch on the missing check,
>>>>>>>>>>>>> I've also made use
>>>>>>>>>>>> of DenseMap as suggested. Test cases omitted from the
>>>>>>>>>>>> previously patch are in place again and a test case for the
>>>>>>>>>>>> missing check you found has been added.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Please review my corrections.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Tyler Nowicki
>>>>>>>>>>>>> Intel
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Eli Friedman [mailto:eli.friedman at gmail.com]
>>>>>>>>>>>>>> Sent: Friday, July 27, 2012 2:17 PM
>>>>>>>>>>>>>> To: Nowicki, Tyler
>>>>>>>>>>>>>> Cc: Sean Silva; llvm-commits at cs.uiuc.edu
>>>>>>>>>>>>>> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom
>>>>>>>>>>>>>> Optimization
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, Jul 27, 2012 at 8:35 AM, Nowicki, Tyler
>>>>>>>>>>>>>> <tyler.nowicki at intel.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> This patch has not yet been reviewed. Could someone take
>>>>>>>>>>>>>>> a look at
>>>>>>>> it?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> +    typedef std::pair<Value *, Value *> DivBOperands;
>>>>>>>>>>>>>> + //Dividend
>>>>>>>>>>>>>> Value, Divisor Value
>>>>>>>>>>>>>> +    typedef std::pair<PHINode *, PHINode *> DivBPhiNodes;
>>>>>>>>>>>>>> + //Quotient
>>>>>>>>>>>>>> PHI, Remainder Phi
>>>>>>>>>>>>>> +    typedef std::map<DivBOperands, DivBPhiNodes>
>>>>>>>>>>>>>> + DivBCacheTy;
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> llvm::DenseMap?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> +  // Replace operation value with previously generated
>>>>>>>>>>>>>> + phi node DivBPhiNodes Value = CacheI->second;  if
>> (UseDivOp) {
>>>>>>>>>>>>>> +    // Replace all uses of div instruction with quotient phi
>> node
>>>>>>>>>>>>>> +    J->replaceAllUsesWith(Value.first);
>>>>>>>>>>>>>> +  } else {
>>>>>>>>>>>>>> +    // Replace all uses of rem instruction with remainder
>>>>>>>>>>>>>> + phi
>>> node
>>>>>>>>>>>>>> +    J->replaceAllUsesWith(Value.second);
>>>>>>>>>>>>>> +  }
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Looks like there's a missing check for UseSignedOp here.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The patch doesn't appear to include any testcases.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -Eli
>>>>>>>>>>>>> <idivtodivb-
>>>>>>>>>>>> 
>>>>>> svn.patch>_______________________________________________
>>>>>>>>>>>>> llvm-commits mailing list
>>>>>>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
> <BypassSlowDiv5-svn.patch>