[LLVMbugs] [Bug 2645] New: Vector splat within loop not optimized
bugzilla-daemon at cs.uiuc.edu
bugzilla-daemon at cs.uiuc.edu
Wed Aug 6 06:07:07 PDT 2008
http://llvm.org/bugs/show_bug.cgi?id=2645
Summary: Vector splat within loop not optimized
Product: new-bugs
Version: unspecified
Platform: PC
OS/Version: Windows NT
Status: NEW
Severity: enhancement
Priority: P2
Component: new bugs
AssignedTo: unassignedbugs at nondot.org
ReportedBy: nicolas at capens.net
CC: llvmbugs at cs.uiuc.edu
The following LLVM IR does not generate optimal x86 code for the vector splat:
define internal void @""(i8*) {
; <label>:1
bitcast i8* %0 to i32* ; <i32*>:2 [#uses=1]
load i32* %2, align 1 ; <i32>:3 [#uses=1]
getelementptr i8* %0, i32 4 ; <i8*>:4 [#uses=1]
bitcast i8* %4 to i32* ; <i32*>:5 [#uses=1]
load i32* %5, align 1 ; <i32>:6 [#uses=1]
br label %7
; <label>:7 ; preds = %9, %1
%.01 = phi <4 x float> [ undef, %1 ], [ %12, %9 ] ; <<4 x
float>> [#uses=1]
%.0 = phi i32 [ %3, %1 ], [ %15, %9 ] ; <i32> [#uses=3]
icmp slt i32 %.0, %6 ; <i1>:8 [#uses=1]
br i1 %8, label %9, label %16
; <label>:9 ; preds = %7
sitofp i32 %.0 to float ; <float>:10 [#uses=1]
insertelement <4 x float> %.01, float %10, i32 0 ; <<4 x
float>>:11 [#uses=1]
shufflevector <4 x float> %11, <4 x float> undef, <4 x i32>
zeroinitializer ; <<4 x float>>:12 [#uses=2]
getelementptr i8* %0, i32 48 ; <i8*>:13 [#uses=1]
bitcast i8* %13 to <4 x float>* ; <<4 x float>*>:14 [#uses=1]
store <4 x float> %12, <4 x float>* %14, align 16
add i32 %.0, 2 ; <i32>:15 [#uses=1]
br label %7
; <label>:16 ; preds = %7
ret void
}
I'm seeing the following output:
0715D060 push ebp
0715D061 mov ebp,esp
0715D063 and esp,0FFFFFFF0h
0715D069 mov eax,dword ptr [ebp+8]
0715D06C mov ecx,dword ptr [eax+4]
0715D06F mov edx,dword ptr [eax]
0715D071 cmp edx,ecx
0715D073 jge 0715D092
0715D079 cvtsi2ss xmm1,edx
0715D07D movss xmm0,xmm1
0715D081 pshufd xmm0,xmm0,0
0715D086 movaps xmmword ptr [eax+30h],xmm0
0715D08A add edx,2
0715D08D jmp 0715D071
0715D092 mov esp,ebp
0715D094 pop ebp
0715D095 ret
Note the unnecessary movss, which appears to assume that the upper elements of
xmm0 will be used later on. I noticed that on a CPU with SSE4 support an
insertps is used instead of a movss, which is even less desirable (two extra
instruction bytes).
Without the loop, it gets optimized to the following:
0719D3A0 push ebp
0719D3A1 mov ebp,esp
0719D3A3 and esp,0FFFFFFF0h
0719D3A9 mov eax,dword ptr [ebp+8]
0719D3AC cvtsi2ss xmm0,dword ptr [eax]
0719D3B0 pshufd xmm0,xmm0,0
0719D3B5 movaps xmmword ptr [eax+30h],xmm0
0719D3B9 mov esp,ebp
0719D3BB pop ebp
0719D3BC ret
Here there's no unnecessary movss and it saves a register. Using an if
statement instead of a loop also results in an optimized splat. So clearly the
logic for optimizing it is in place, it's just not applied in the presence of a
loop.
I looked in DAGCombiner.cpp and X86ISelLowering.cpp, as both contain some
shuffle/splat related optimizations, but couldn't locate the discrepancy yet.
--
Configure bugmail: http://llvm.org/bugs/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the llvm-bugs
mailing list