[LLVMbugs] [Bug 2585] New: Unoptimal vector 'trunc' emulation (and crash)

Tue Jul 22 15:54:35 PDT 2008

http://llvm.org/bugs/show_bug.cgi?id=2585

           Summary: Unoptimal vector 'trunc' emulation (and crash)
           Product: new-bugs
           Version: unspecified
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: new bugs
        AssignedTo: unassignedbugs at nondot.org
        ReportedBy: nicolas at capens.net
                CC: llvmbugs at cs.uiuc.edu

In an attempt to do an element-wise trunc of <4 x i32> to <4 x i16> I tried
using the following LLVM IR:

external constant <4 x i32>             ; <<4 x i32>*>:0 [#uses=1]
external constant <4 x i16>             ; <<4 x i16>*>:1 [#uses=1]

define internal void @""() {
        load <4 x i32>* @0, align 16            ; <<4 x i32>>:1 [#uses=1]
        bitcast <4 x i32> %1 to <8 x i16>               ; <<8 x i16>>:2
[#uses=1]
        shufflevector <8 x i16> %2, <8 x i16> undef, <8 x i32> < i32 0, i32 2,
i32 4, i32 6, i32 undef, i32 undef, i32 undef, i32 undef >               ; <<8
x i16>>:3 [#uses=1]
        bitcast <8 x i16> %3 to <2 x i64>               ; <<2 x i64>>:4
[#uses=1]
        extractelement <2 x i64> %4, i32 0              ; <i64>:5 [#uses=1]
        bitcast i64 %5 to <4 x i16>             ; <<4 x i16>>:6 [#uses=1]
        store <4 x i16> %6, <4 x i16>* @1, align 8
        ret void
}

Unfortunately there's a access vilation originating in
LowerVECTOR_SHUFFLEv8i16, due to the undefs in the shuffle mask. Using 02460246
instead gives me the following x86 code:

  push        ebp  
  mov         ebp,esp 
  and         esp,0FFFFFFF0h 
  sub         esp,40h 
  movaps      xmm0,xmmword ptr ds:[48C7700h] 
  pextrw      eax,xmm0,4 
  movaps      xmm1,xmm0 
  punpcklqdq  xmm1,xmm1 
  pshuflw     xmm1,xmm1,88h 
  pshufhw     xmm1,xmm1,88h 
  pinsrw      xmm1,eax,2 
  pextrw      ecx,xmm0,6 
  pinsrw      xmm1,ecx,3 
  pinsrw      xmm1,eax,6 
  pinsrw      xmm1,ecx,7 
  movaps      xmmword ptr [esp],xmm1 
  mov         eax,dword ptr [esp+4] 
  mov         dword ptr [esp+1Ch],eax 
  mov         eax,dword ptr [esp] 
  mov         dword ptr [esp+18h],eax 
  movq        mm0,mmword ptr [esp+18h] 
  movq        mmword ptr ds:[48C76F8h],mm0 
  mov         esp,ebp 
  pop         ebp  
  ret           

This is not very optimal. Instead I expected something like:

  push        ebp  
  mov         ebp,esp 
  and         esp,0FFFFFFF0h 
  sub         esp,40h 
  movaps      xmm0,xmmword ptr ds:[48C7700h] 
  pshuflw     xmm0,xmm0,0x88
  pshufhw     xmm0,xmm0,0x88
  pshufd      xmm0,xmm0,0x88
  movdq2q     mm0,xmm0
  movq        mmword ptr ds:[48C76F8h],mm0 
  mov         esp,ebp 
  pop         ebp  
  ret           

That's essentially 4 instead of 16 instructions!

Since a 'trunc' of <4 x i32> to <4 x i16> can be quite useful I think, it might
be worth it to have specialized codegen for this. Interestingly the
pshuflw+pshufhw already gets generated, but after that it fails to generate
pshufd+movdq2q.

-- 
Configure bugmail: http://llvm.org/bugs/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.