[LLVMdev] Lowering to MMX

Thu Oct 27 12:16:26 PDT 2011

On 26/10/2011 6:35 PM, Bill Wendling wrote:
> On Oct 26, 2011, at 1:18 PM, Nicolas Capens wrote:
>
>> I'm having one remaining issue though; I can't seem to generate the movd instruction(s) (moving 32-bits of data in and out of the lower half of an MMX registers). Take for example the following LLVM IR:
>>
>> define internal void @unpack(i8*, i8*) {
>>   %3 = bitcast i8* %1 to i32*
>>   %4 = load i32* %3, align 1
>>   %5 = insertelement<2 x i32>  undef, i32 %4, i32 0
>>   %6 = bitcast<2 x i32>  %5 to x86_mmx
>>   %7 = call x86_mmx @llvm.x86.mmx.punpcklbw(x86_mmx %6, x86_mmx %6)
>>   %8 = bitcast i8* %0 to x86_mmx*
>>   store x86_mmx %7, x86_mmx* %8, align 1
>>   ret void
>> }
>> declare x86_mmx @llvm.x86.mmx.punpcklbw(x86_mmx, x86_mmx) nounwind readnone
>>
>> Which gives me the following assembly code:
>>
>> push        ebp
>> mov         ebp,esp
>> and         esp,0FFFFFFF0h
>> sub         esp,20h
>> mov         eax,dword ptr [ebp+0Ch]
>> movd        xmm0,dword ptr [eax]
>> movapd      xmmword ptr [esp],xmm0
>> movq        mm0,mmword ptr [esp]
>> punpcklbw   mm0,mm0
>> mov         eax,dword ptr [ebp+8]
>> movq        mmword ptr [eax],mm0
>> emms
>> mov         esp,ebp
>> pop         ebp
>> ret
>>
>> The inner portion could look like this instead:
>>
>> movd        mm0,dword ptr [eax]
>> punpcklbw   mm0,mm0
>>
>> Should I be using other IR operations to get this result, or are the matching patterns missing? Or would it perhaps be best to make movd available as an intrinsic as well (note that it has four varieties for MMX)?
> I don't think it's a missing pattern. I think it's the backend trying to use the best instructions available. I get this if I turn off SSE:
>
> [Irk:llvm] llc -o - t.ll -mattr=-sse,+mmx -O3 -x86-asm-syntax=intel
> 	.section	__TEXT,__text,regular,pure_instructions
> 	.align	4, 0x90
> _unpack:                                ## @unpack
> Ltmp0:
> 	.cfi_startproc
> ## BB#0:
> 	mov	ECX, DWORD PTR [RSI]
> 	shl	RAX, 32
> 	or	RAX, RCX
> 	movd	MM0, RAX
> 	punpcklbw	MM0, MM0
> 	movq	QWORD PTR [RDI], MM0
> 	ret
> Ltmp1:
> 	.cfi_endproc
> Leh_func_end0:
>

That's interesting. It means that somewhere along the way the v2i32 
insert gets promoted into an v4i32 insert because it's assumed to be 
better. Perhaps it can be detected that it gets consumed by an MMX 
intrinsic (after the bitcast) so this promotion isn't performed. Do you 
happen to know what parts of code I could look through to change this 
behavior?

Thanks,
Nicolas