[LLVMbugs] [Bug 2094] New: Inefficient code generated for inline asm with multiple in-out register operands

Mon Feb 25 21:17:09 PST 2008

http://llvm.org/bugs/show_bug.cgi?id=2094

           Summary: Inefficient code generated for inline asm with multiple
                    in-out register operands
           Product: new-bugs
           Version: unspecified
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: new bugs
        AssignedTo: unassignedbugs at nondot.org
        ReportedBy: sharparrow1 at yahoo.com
                CC: llvmbugs at cs.uiuc.edu

Testcase:
#include <stdint.h>

int sad16_sse2(void *v, uint8_t *blk2, uint8_t *blk1, int stride, int h)
{
    int ret;
    asm volatile(
        "pxor %%xmm6, %%xmm6            \n\t"
        //ASMALIGN(4)
        "1:                             \n\t"
        "movdqu (%1), %%xmm0            \n\t"
        "movdqu (%1, %3), %%xmm1        \n\t"
        "psadbw (%2), %%xmm0            \n\t"
        "psadbw (%2, %3), %%xmm1        \n\t"
        "paddw %%xmm0, %%xmm6           \n\t"
        "paddw %%xmm1, %%xmm6           \n\t"
        "lea (%1,%3,2), %1              \n\t"
        "lea (%2,%3,2), %2              \n\t"
        "sub $2, %0                     \n\t"
        " jg 1b                         \n\t"
        : "+r" (h), "+r" (blk1), "+r" (blk2)
        : "r" ((long)stride)
    );
    asm volatile(
        "movhlps %%xmm6, %%xmm0         \n\t"
        "paddw   %%xmm0, %%xmm6         \n\t"
        "movd    %%xmm6, %0             \n\t"
        : "=r"(ret)
    );
    return ret;
}

Generated code:

        pushl   %esi
        subl    $8, %esp
        movl    20(%esp), %edx
        movl    %edx, 4(%esp)
        movl    24(%esp), %ecx
        movl    %ecx, (%esp)
        movl    32(%esp), %eax
        movl    28(%esp), %esi
        #APP
        pxor %xmm6, %xmm6            
        1:                             
        movdqu (%ecx), %xmm0            
        movdqu (%ecx, %esi), %xmm1        
        psadbw (%edx), %xmm0            
        psadbw (%edx, %esi), %xmm1        
        paddw %xmm0, %xmm6           
        paddw %xmm1, %xmm6           
        lea (%ecx,%esi,2), %ecx              
        lea (%edx,%esi,2), %edx              
        sub $2, %eax                     
         jg 1b                         

        #NO_APP
        movl    %edx, 4(%esp)
        movl    %ecx, (%esp)
        #APP
        movhlps %xmm6, %xmm0         
        paddw   %xmm0, %xmm6         
        movd    %xmm6, %eax             

        #NO_APP
        addl    $8, %esp
        popl    %esi
        ret

(We'll put aside for the moment the fact that this code is extremely dangerous
because a compiler using certain kinds of optimizations might actually end up
using the xmm regs between the two asm statements.)

The generated code ends up being rather inefficient in that it emits four
unnecessary stores to the stack, plus allocation for the necessary space. I
think it's because blk1 and blk2 have to be put into alloca's at the il level,
and codegen isn't smart enough to eliminate them.  Not sure what the right fix
is; maybe inline asm should take advantage of the multiple return value work?

(I don't know how much fixing this will help, but this function shows up at the
top of a profile in ffmpeg re-encoding from h.264 to mpeg4, so every bit likely
helps.)

-- 
Configure bugmail: http://llvm.org/bugs/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.