[llvm-commits] [PATCH]IDIV->DIVB Atom Optimization

Fri Jul 20 13:09:26 PDT 2012

+  // Search for existing instance of Dividend and Divisor in DivBList
+  // and replace with previously generated phi nodes
+
+  for(std::list<DivBNode>::iterator iselI = DivBList.begin();
+      iselI != DivBList.end(); iselI++) {
+    if (iselI->Dividend == Dividend && iselI->Divisor == Divisor) {
+      // Replace operation value with previously generated phi node
+      if (UseDivOp) {
+        J->replaceAllUsesWith(iselI->DivPhi);
+      } else {
+        J->replaceAllUsesWith(iselI->RemPhi);
+      }
+
+      // Advance to next operation
+      J++;
+
+      // Remove redundant operation
+      Instr->eraseFromParent();
+
+      return;
+    }
+  }

Instead of doing a linear search here, use an appropriate set data
structure <http://llvm.org/docs/ProgrammersManual.html#ds_set>.
Otherwise, you run the risk of taking quadratic time, which will
usually result in someone filing a "this compilation never seems to
end" bug somewhere down the line.

Now that I see that this is a cache that is just trying to reuse
previous results, you should probably call `replaceOrInsertFastDiv`
something like `reuseOrInsertFastDiv`.

Now that I understand the structure of what this is doing a bit
better, you could probably make the behavior a bit clearer by
factoring out what is done on each basic block into a single static
function handleBasicBlock(F, I, DivBCache). You can then instantiate
DivBCache at the beginning of runOnFunction, and call clear after the
call to handleBasicBlock, like this:

bool X86LowerDiv::runOnFunction(Function &F) {
  DivBCacheTy PerBBDivBCache;
  for (Function::iterator I = F.begin(); I != F.end(); I++) {
    handleBasicBlock(F, I, PerBBDivBCache);
    PerBBDivBCache.clear();
  }
}

The benefit of this is that it increases locality in the source;
instead of having the cache declared all the way up in the class
declaration, you can see it being declared right where it is used.
Ideally, PerBBDivBCache would be instantiated as a local variable in
`handleBasicBlock`, but as I'm sure you're aware, there is a
nontrivial performance penalty to performing the allocation for the
map container every time through a loop.

Also mentioned in the LLVM coding standards
<http://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code>,
use a continue in the loop over BB's, so instead of

if (Ty->isIntegerTy(32)) {
  /* indented stuff */
}

instead prefer

if (!J->getType()->isIntegerTy(32))
  continue;
/* stuff that is now not indented as much */

I also moved in the J->getType() instead of the local variable `Ty`
because the local variable was adding two lines of code, which
decreases locality when reading and is otherwise distracting (in this
case, at least; naming things can help a lot in a lot of cases, but
here the benefits seem marginal).

(btw, remember to choose "reply all" when replying, so that your
replies go to the list as well.)

--Sean Silva

On Fri, Jul 20, 2012 at 2:02 PM, Nowicki, Tyler <tyler.nowicki at intel.com> wrote:
> Thank you for your reply and for catching my deviations from the coding standard.
>
>  The suggested code you gave does not check for a div/rem, rather it only identifies the Boolean details. With you changes and the extra check it looks like this:
>
>       // Get instruction details
>       unsigned Opcode = J->getOpcode();
>       bool UseDivOp = Opcode == Instruction::SDiv || Opcode == Instruction::UDiv;
>       bool UseRemOp = Opcode == Instruction::SRem || Opcode == Instruction::URem;
>       bool UseSignedOp = Opcode == Instruction::SDiv || Opcode == Instruction::SRem;
>
>       if (UseDivOp || UseRemOp) {
>         replaceOrInsertFastDiv(F, I, J, UseDivOp, UseSignedOp);
>       }
>
> Sinking the test would require the '//Get instruction details' code to be added to both insertFastDiv and replaceOrInsertFastDiv, with some omissions for unused variables. If that sounds like the right idea to you then I'll make that change, otherwise I will pass the Boolean values.
>
> Would you suggest I duplicate the '//Get instruction details' code in each function that needs the variables?
>
> Tyler
>
> -----Original Message-----
> From: Sean Silva [mailto:silvas at purdue.edu]
> Sent: Thursday, July 19, 2012 8:15 PM
> To: Nowicki, Tyler
> Cc: llvm-commits at cs.uiuc.edu
> Subject: Re: [llvm-commits] [PATCH]IDIV->DIVB Atom Optimization
>
> +                                Function::iterator& I,
> +                                BasicBlock::iterator& J,
> +                                bool UseDivOp, bool UseSignedOp);
> +  private:
> +    struct DivBNode {
> +      PHINode* DivPhi;
> +      PHINode* RemPhi;
>
> LLVM coding style <http://llvm.org/docs/CodingStandards.html> puts the '*' or '&' on the right.
>
> +        if (J->getOpcode() == Instruction::SDiv) {
> +          replaceOrInsertFastDiv(F, I, J, true, true);
> +        } else if (J->getOpcode() == Instruction::UDiv) {
> +          replaceOrInsertFastDiv(F, I, J, true, false);
> +        } else if (J->getOpcode() == Instruction::SRem) {
> +          replaceOrInsertFastDiv(F, I, J, false, true);
> +        } else if (J->getOpcode() == Instruction::URem) {
> +          replaceOrInsertFastDiv(F, I, J, false, false);
> +        }
>
> This could probably be factored better to make it clearer. Maybe something like
>
> unsigned Opcode = J->getOpcode();
> bool UseDivOp = Opcode == Instruction::SDiv || Opcode == Instruction::UDiv; bool UseSignedOp = Opcode == Instruction::SDiv || Opcode == Instruction::SRem; replaceOrInsertFastDiv(F, I, J, UseDivOp, UseSignedOp);
>
> Or maybe since you're passing in J anyway, sink these tests into replaceOrInsertFastDiv.
>
> +void X86LowerDiv::insertFastDiv(Function &F,
> +                                  Function::iterator& I,
>
> This could be aligned better.
>
> --Sean Silva
>
>
> On Thu, Jul 19, 2012 at 4:05 PM, Nowicki, Tyler <tyler.nowicki at intel.com> wrote:
>> Hi,
>>
>> Here is an optimization for Intel Atom processors which uses a DIVB instruction rather than an IDIV when both the dividend and divisor are positive values less than 256. We've tested this with a number of benchmark suites and it yields a positive performance improvement due to the slowness of a 32-bit divide on Atom architectures.
>>
>> Commit message:
>> IDIV->DIVB optimization
>>   - Enabled only for Intel Atom with O2
>>   - Use DIVB instruction rather than IDIV when dividend and divisor are positive less than 256.
>>   - In the case when the quotient and remainder of a divide are used a DIV and a REM instruction will be present in the IR. In the non-Atom case they are both lowered to IDIVs and CSE removes the redundant IDIV instruction, using the quotient and remainder from the first IDIV. However, due to this optimization CSE is not able to eliminate redundant IDIV instructions because they are located in different basic blocks. This is overcome by calculating both the quotient (DIV) and remainder (REM) in each basic block that is inserted by the optimization and reusing the result values when a subsequent DIV or REM instruction uses the same operands.
>>   - Test cases check for the optimization when calculating a quotient, remainder, or both.
>>
>> Tyler Nowicki
>> Intel
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>