           Summary: extractps selected too eagerly
           Product: new-bugs
           Version: unspecified
          Platform: PC
        OS/Version: Windows NT
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: new bugs
        AssignedTo: unassignedbugs at nondot.org
        ReportedBy: nicolas at capens.net
                CC: llvmbugs at cs.uiuc.edu

The following LLVM IR compiles to suboptimal code on x86 CPUs with SSE4
support, but optimizes fine on older CPUs:

external global float, align 16         ; <float*>:0 [#uses=2]

define internal void @""() {
        load float* @0, align 16                ; <float>:1 [#uses=1]
        insertelement <4 x float> undef, float %1, i32 0                ; <<4 x
float>>:2 [#uses=1]
        call <4 x float> @llvm.x86.sse.rsqrt.ss( <4 x float> %2 )              
; <<4 x float>>:3 [#uses=1]
        extractelement <4 x float> %3, i32 0            ; <float>:4 [#uses=1]
        store float %4, float* @0, align 16
        ret void

declare <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float>) nounwind readnone

Here's the result on a Penryn CPU:

  push        ebp  
  mov         ebp,esp 
  and         esp,0FFFFFFF0h 
  rsqrtss     xmm0,dword ptr ds:[1762ED0h] 
  extractps   eax, xmm0
  movd        xmm0,eax 
  movss       dword ptr ds:[1762ED0h],xmm0 
  mov         esp,ebp 
  pop         ebp  

And this is the lovable code I get on Conroe:

  rsqrtss     xmm0,dword ptr ds:[1762ED0h] 
  movss       dword ptr ds:[1762ED0h],xmm0 

Ignoring the stack setup for now, it looks like extractps is selected too
eagerly for an extractelement v4f32, 0.

P.S: To quickly test with and without SSE4 support just force X86SSELevel to
the desired value in X86Subtarget::AutoDetectSubtargetFeatures().

