[LLVMbugs] [Bug 2094] New: Inefficient code generated for inline asm with multiple in-out register operands
bugzilla-daemon at cs.uiuc.edu
bugzilla-daemon at cs.uiuc.edu
Mon Feb 25 21:17:09 PST 2008
http://llvm.org/bugs/show_bug.cgi?id=2094
Summary: Inefficient code generated for inline asm with multiple
in-out register operands
Product: new-bugs
Version: unspecified
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: new bugs
AssignedTo: unassignedbugs at nondot.org
ReportedBy: sharparrow1 at yahoo.com
CC: llvmbugs at cs.uiuc.edu
Testcase:
#include <stdint.h>
int sad16_sse2(void *v, uint8_t *blk2, uint8_t *blk1, int stride, int h)
{
int ret;
asm volatile(
"pxor %%xmm6, %%xmm6 \n\t"
//ASMALIGN(4)
"1: \n\t"
"movdqu (%1), %%xmm0 \n\t"
"movdqu (%1, %3), %%xmm1 \n\t"
"psadbw (%2), %%xmm0 \n\t"
"psadbw (%2, %3), %%xmm1 \n\t"
"paddw %%xmm0, %%xmm6 \n\t"
"paddw %%xmm1, %%xmm6 \n\t"
"lea (%1,%3,2), %1 \n\t"
"lea (%2,%3,2), %2 \n\t"
"sub $2, %0 \n\t"
" jg 1b \n\t"
: "+r" (h), "+r" (blk1), "+r" (blk2)
: "r" ((long)stride)
);
asm volatile(
"movhlps %%xmm6, %%xmm0 \n\t"
"paddw %%xmm0, %%xmm6 \n\t"
"movd %%xmm6, %0 \n\t"
: "=r"(ret)
);
return ret;
}
Generated code:
pushl %esi
subl $8, %esp
movl 20(%esp), %edx
movl %edx, 4(%esp)
movl 24(%esp), %ecx
movl %ecx, (%esp)
movl 32(%esp), %eax
movl 28(%esp), %esi
#APP
pxor %xmm6, %xmm6
1:
movdqu (%ecx), %xmm0
movdqu (%ecx, %esi), %xmm1
psadbw (%edx), %xmm0
psadbw (%edx, %esi), %xmm1
paddw %xmm0, %xmm6
paddw %xmm1, %xmm6
lea (%ecx,%esi,2), %ecx
lea (%edx,%esi,2), %edx
sub $2, %eax
jg 1b
#NO_APP
movl %edx, 4(%esp)
movl %ecx, (%esp)
#APP
movhlps %xmm6, %xmm0
paddw %xmm0, %xmm6
movd %xmm6, %eax
#NO_APP
addl $8, %esp
popl %esi
ret
(We'll put aside for the moment the fact that this code is extremely dangerous
because a compiler using certain kinds of optimizations might actually end up
using the xmm regs between the two asm statements.)
The generated code ends up being rather inefficient in that it emits four
unnecessary stores to the stack, plus allocation for the necessary space. I
think it's because blk1 and blk2 have to be put into alloca's at the il level,
and codegen isn't smart enough to eliminate them. Not sure what the right fix
is; maybe inline asm should take advantage of the multiple return value work?
(I don't know how much fixing this will help, but this function shows up at the
top of a profile in ffmpeg re-encoding from h.264 to mpeg4, so every bit likely
helps.)
--
Configure bugmail: http://llvm.org/bugs/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the llvm-bugs
mailing list