[LLVMdev] Codegen performance issue: LEA vs. INC.

Wed Oct 2 08:38:56 PDT 2013

This sounds like llvm.org/pr13320.

On 17 September 2013 18:20, Bader, Aleksey A <aleksey.a.bader at intel.com> wrote:
> Hi all.
>
>
>
> I’m looking for an advice on how to deal with inefficient code generation
> for Intel Nehalem/Westmere architecture on 64-bit platform for the attached
> test.cpp (LLVM IR is in test.cpp.ll).
>
> The inner loop has 11 iterations and eventually unrolled.
>
> Test.lea.s is the assembly code of the outer loop. It simply has 11 loads,
> 11 FP add, 11 FP mull, 1 FP store and lea+mov for index computation, cmp and
> jump.
>
> The problem is that lea is on critical path because it’s dispatched on the
> same port as all FP add operations (port 1).
>
> Intel Architecture Code Analyzer (IACA) reports throughput for that assembly
> block is 12.95 cycles.
>
> I made a short investigation and found that there is a pass in code gen that
> replaces index increment with lea.
>
> Here is the snippet from llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
>
>
>
> if (MI.isConvertibleTo3Addr()) {
>
>   // This instruction is potentially convertible to a true
>
>   // three-address instruction.  Check if it is profitable.
>
>   if (!regBKilled || isProfitableToConv3Addr(regA, regB)) {
>
>     // Try to convert it.
>
>     if (convertInstTo3Addr(mi, nmi, regA, regB, Dist)) {
>
>       ++NumConvertedTo3Addr;
>
>       return true; // Done with this instruction.
>
>     }
>
>   }
>
> }
>
>
>
> regBKilled is false for my test case and isProfitableToConv3Addr is not even
> called.
>
> I’ve made an experiment and left only
>
>
>
> if (isProfitableToConv3Addr(regA, regB)) {
>
>
>
> That gave me test.inc.s where lea replaced with inc+mov and this code is
> ~27% faster on my Westmere system. IACA throughput analysis gives 11 cycles
> for new block.
>
>
>
> But the best performance I’ve got from switching scheduling algorithm from
> ILP to BURR (test.burr.s). It gives a few percent more vs. “ILP+INC” and I’m
> not sure why – it might be because test.burr.s has less instructions (no two
> moves that copy index) or it might be because additions scheduled
> differently. BURR puts loads and FP mul between additions, which are
> gathered at the end of the loop by ILP.
>
>
>
> I didn’t run experiments on sandy bridge, but IACA gives 12.45 cycles for
> original code (test.lea.s), so I expect BURR to improve performance there
> too for the attached test case.
>
>
>
> Unfortunately I’m familiar enough with the LLVM codegen code to make a good
> fix for this issue and I would appreciate any help.
>
>
>
> Thanks,
>
> Aleksey
>
>
> --------------------------------------------------------------------
> Closed Joint Stock Company Intel A/O
> Registered legal address: Krylatsky Hills Business Park,
> 17 Krylatskaya Str., Bldg 4, Moscow 121614,
> Russian Federation
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>