[LLVMbugs] [Bug 2645] New: Vector splat within loop not optimized

Wed Aug 6 06:07:07 PDT 2008

http://llvm.org/bugs/show_bug.cgi?id=2645

           Summary: Vector splat within loop not optimized
           Product: new-bugs
           Version: unspecified
          Platform: PC
        OS/Version: Windows NT
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: new bugs
        AssignedTo: unassignedbugs at nondot.org
        ReportedBy: nicolas at capens.net
                CC: llvmbugs at cs.uiuc.edu

The following LLVM IR does not generate optimal x86 code for the vector splat:

define internal void @""(i8*) {
; <label>:1
        bitcast i8* %0 to i32*          ; <i32*>:2 [#uses=1]
        load i32* %2, align 1           ; <i32>:3 [#uses=1]
        getelementptr i8* %0, i32 4             ; <i8*>:4 [#uses=1]
        bitcast i8* %4 to i32*          ; <i32*>:5 [#uses=1]
        load i32* %5, align 1           ; <i32>:6 [#uses=1]
        br label %7

; <label>:7             ; preds = %9, %1
        %.01 = phi <4 x float> [ undef, %1 ], [ %12, %9 ]               ; <<4 x
float>> [#uses=1]
        %.0 = phi i32 [ %3, %1 ], [ %15, %9 ]           ; <i32> [#uses=3]
        icmp slt i32 %.0, %6            ; <i1>:8 [#uses=1]
        br i1 %8, label %9, label %16

; <label>:9             ; preds = %7
        sitofp i32 %.0 to float         ; <float>:10 [#uses=1]
        insertelement <4 x float> %.01, float %10, i32 0                ; <<4 x
float>>:11 [#uses=1]
        shufflevector <4 x float> %11, <4 x float> undef, <4 x i32>
zeroinitializer             ; <<4 x float>>:12 [#uses=2]
        getelementptr i8* %0, i32 48            ; <i8*>:13 [#uses=1]
        bitcast i8* %13 to <4 x float>*         ; <<4 x float>*>:14 [#uses=1]
        store <4 x float> %12, <4 x float>* %14, align 16
        add i32 %.0, 2          ; <i32>:15 [#uses=1]
        br label %7

; <label>:16            ; preds = %7
        ret void
}

I'm seeing the following output:

0715D060  push        ebp  
0715D061  mov         ebp,esp 
0715D063  and         esp,0FFFFFFF0h 
0715D069  mov         eax,dword ptr [ebp+8] 
0715D06C  mov         ecx,dword ptr [eax+4] 
0715D06F  mov         edx,dword ptr [eax] 
0715D071  cmp         edx,ecx 
0715D073  jge         0715D092 
0715D079  cvtsi2ss    xmm1,edx 
0715D07D  movss       xmm0,xmm1 
0715D081  pshufd      xmm0,xmm0,0 
0715D086  movaps      xmmword ptr [eax+30h],xmm0 
0715D08A  add         edx,2 
0715D08D  jmp         0715D071 
0715D092  mov         esp,ebp 
0715D094  pop         ebp  
0715D095  ret    

Note the unnecessary movss, which appears to assume that the upper elements of
xmm0 will be used later on. I noticed that on a CPU with SSE4 support an
insertps is used instead of a movss, which is even less desirable (two extra
instruction bytes).

Without the loop, it gets optimized to the following:

0719D3A0  push        ebp  
0719D3A1  mov         ebp,esp 
0719D3A3  and         esp,0FFFFFFF0h 
0719D3A9  mov         eax,dword ptr [ebp+8] 
0719D3AC  cvtsi2ss    xmm0,dword ptr [eax] 
0719D3B0  pshufd      xmm0,xmm0,0 
0719D3B5  movaps      xmmword ptr [eax+30h],xmm0 
0719D3B9  mov         esp,ebp 
0719D3BB  pop         ebp  
0719D3BC  ret     

Here there's no unnecessary movss and it saves a register. Using an if
statement instead of a loop also results in an optimized splat. So clearly the
logic for optimizing it is in place, it's just not applied in the presence of a
loop.

I looked in DAGCombiner.cpp and X86ISelLowering.cpp, as both contain some
shuffle/splat related optimizations, but couldn't locate the discrepancy yet.

-- 
Configure bugmail: http://llvm.org/bugs/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.