[LLVMdev] Lowering to MMX

Wed Oct 26 15:35:32 PDT 2011

On Oct 26, 2011, at 1:18 PM, Nicolas Capens wrote:

> On 24/10/2011 9:50 PM, Bill Wendling wrote:
>> On Oct 20, 2011, at 8:42 AM, Nicolas Capens wrote:
>> 
>>> Hi all,
>>> 
>>> I'm working on a graphics project which uses LLVM for dynamic code
>>> generation, and I noticed a major performance regression when upgrading
>>> from LLVM 2.8 to 3.0-rc1 (LLVM 2.9 didn't support Win64 so I skipped it
>>> entirely).
>>> 
>>> I found out that the performance regression is due to removing support
>>> for lowering 64-bit vector operations to MMX, and using SSE2 instead. My
>>> code uses a mix of MMX intrinsics and v4i16 operations, so it ping-pongs
>>> back and forth between MMX and SSE2 instructions in the generated code.
>>> 
>>> To get more optimal code, I see three options, and I was wondering if
>>> someone could share some advice on which approach you think will work best:
>>> 1) I could use v8i16 or v4i32 instead of v4i16, but then the SSE
>>> register pressure would be significantly increased. I already use v4f32
>>> operations intensively so having the MMX registers available for 64-bit
>>> integer vector operations helps performance quite considerably on the
>>> register deprived x86 architecture. There's little to no opportunity for
>>> using v8i16 to perform two v4i16 operations simultaneously so that won't
>>> make up for the added register pressure. So I'm not keen to implement
>>> this option, unless anyone sees some advantages that I missed?
>> It's my understanding that SSE is by far superior to MMX for a number of reasons, not the least of which is the need to use the expensive EMMS instruction. Instead of guessing about the performance impact, I would encourage you to test this out.
> I'm already explicitly using EMMS, where necessary. Basically when avoiding x87 (and avoiding library calls which may use x87), it's not needed. So there's no performance drawback for using MMX in my case.
> 
> I've verified that combining MMX and SSE2 is significantly faster than using SSE2 alone. It basically gives me access to 8 more registers for 64-bit integer vector operations, leaving the SSE registers available for floating-point and wider integer operations. Upgrading from LLVM 2.8 to 3.0 degraded performance by 30%, and a quick look at the assembly made it clear that register pressure is a major issue.

Okay. Cool.

>>> 3) I believe all MMX instructions are available as intrinsics now? That
>>> would allow me to replace all straight LLVM operations with intrinsics.
>>> I'm just wondering what the downsides of that would be? I assume I won't
>>> get any benefits from instruction combining, but things like dead code
>>> elimination still work?
>> Intrinsics are the only way to go if you want MMX code. We do as much as we can, but to be honest optimizing for MMX is not a high priority for us.
> I fully understand that having LLVM insert EMMS instructions and trying to prevent it from degrading performance just wasn't worthwhile.

For what it's worth, LLVM doesn't insert EMMS instructions for you.

> Fortunately explicit use of MMX intrinsics is fine for my use.

Great!

> I'm having one remaining issue though; I can't seem to generate the movd instruction(s) (moving 32-bits of data in and out of the lower half of an MMX registers). Take for example the following LLVM IR:
> 
> define internal void @unpack(i8*, i8*) {
>  %3 = bitcast i8* %1 to i32*
>  %4 = load i32* %3, align 1
>  %5 = insertelement <2 x i32> undef, i32 %4, i32 0
>  %6 = bitcast <2 x i32> %5 to x86_mmx
>  %7 = call x86_mmx @llvm.x86.mmx.punpcklbw(x86_mmx %6, x86_mmx %6)
>  %8 = bitcast i8* %0 to x86_mmx*
>  store x86_mmx %7, x86_mmx* %8, align 1
>  ret void
> }
> declare x86_mmx @llvm.x86.mmx.punpcklbw(x86_mmx, x86_mmx) nounwind readnone
> 
> Which gives me the following assembly code:
> 
> push        ebp
> mov         ebp,esp
> and         esp,0FFFFFFF0h
> sub         esp,20h
> mov         eax,dword ptr [ebp+0Ch]
> movd        xmm0,dword ptr [eax]
> movapd      xmmword ptr [esp],xmm0
> movq        mm0,mmword ptr [esp]
> punpcklbw   mm0,mm0
> mov         eax,dword ptr [ebp+8]
> movq        mmword ptr [eax],mm0
> emms
> mov         esp,ebp
> pop         ebp
> ret
> 
> The inner portion could look like this instead:
> 
> movd        mm0,dword ptr [eax]
> punpcklbw   mm0,mm0
> 
> Should I be using other IR operations to get this result, or are the matching patterns missing? Or would it perhaps be best to make movd available as an intrinsic as well (note that it has four varieties for MMX)?

I don't think it's a missing pattern. I think it's the backend trying to use the best instructions available. I get this if I turn off SSE:

[Irk:llvm] llc -o - t.ll -mattr=-sse,+mmx -O3 -x86-asm-syntax=intel
	.section	__TEXT,__text,regular,pure_instructions
	.align	4, 0x90
_unpack:                                ## @unpack
Ltmp0:
	.cfi_startproc
## BB#0:
	mov	ECX, DWORD PTR [RSI]
	shl	RAX, 32
	or	RAX, RCX
	movd	MM0, RAX
	punpcklbw	MM0, MM0
	movq	QWORD PTR [RDI], MM0
	ret
Ltmp1:
	.cfi_endproc
Leh_func_end0:

I don't know if this will work for you at all. Of course, one other option is to use inline assembly... (That's one of the first time I've ever suggested that. :-) )

-bw