[LLVMdev] Tight overlapping loops and performance

Mon Mar 2 16:58:16 PST 2009

> You're misreading the asm... nothing is touching memory. (BTW, "leal
> -1(%eax), %eax" isn't a memory operation; it's just subtracting one
> from %eax.) You might want to try reading the LLVM IR (which you can
> generate with llvm-gcc -S -emit-llvm); it tends to be easier to read.

I tried that, but I'm still learning LLVM. Seeing indvar, phi nodes, tail
calls on printfs, and nounwinds had me more confused than the asm.

> A taken and non-taken branch have roughly the same cost on any
> remotely recent x86 processor.

I was wondering if that might be the case.

The crux of the example still seems intact.  From LLVM SVN, converted to asm via llc:

                .text
        .align  4,0x90
        .globl  _main
_main:
        subl    $12, %esp
        movl    $1999, %eax
        xorl    %ecx, %ecx
        movl    $1999, %edx
        .align  4,0x90
LBB1_1: ## loopto
        cmpl    $1, %eax
        leal    -1(%eax), %eax
        cmove   %edx, %eax
        incl    %ecx
        cmpl    $999999999, %ecx
        jne     LBB1_1  ## loopto
LBB1_2: ## bb1
        movl    %eax, 4(%esp)
        movl    $LC, (%esp)
        call    _printf
        xorl    %eax, %eax
        addl    $12, %esp
        ret
        .section __TEXT,__cstring,cstring_literals
LC:                             ## LC
        .asciz  "Timeout: %i\n"

        .subsections_via_symbols

Setting the loops to decl instead of cmove/incl might seem like more work, but appears to be faster:

        .text
        .align  4,0x90
        .globl  _main
_main:
        subl    $12, %esp
        movl    $2000, %eax
        movl    $1000000000, %ecx
        .align  4,0x90
LBB1_3:
        movl    $2000, %eax
LBB1_1: ## loopto
        decl    %eax
        jz      LBB1_3
        decl    %ecx
        jnz     LBB1_1  ## loopto
LBB1_2: ## bb1
        movl    %eax, 4(%esp)
        movl    $LC, (%esp)
        call    _printf
        xorl    %eax, %eax
        addl    $12, %esp
        ret
        .section __TEXT,__cstring,cstring_literals
LC:                             ## LC
        .asciz  "Timeout: %i\n"

        .subsections_via_symbols

The first example is 1.7s, the second is 1.0s.  That's on my dual core OS X box.  I have a 2-processor quad-core Xeon box that runs Linux and also has very similar results.  

Jonathan

_________________________________________________________________
Windows Live™ Contacts: Organize your contact list. 
http://windowslive.com/connect/post/marcusatmicrosoft.spaces.live.com-Blog-cns!503D1D86EBB2B53C!2285.entry?ocid=TXT_TAGLM_WL_UGC_Contacts_032009