[LLVMdev] How does SSEDomainFix work?

Jakob Stoklund Olesen stoklund at 2pi.dk
Tue May 11 08:08:53 PDT 2010


On May 10, 2010, at 9:07 PM, NAKAMURA Takumi wrote:

> Hello. This is my 1st post.

ようこそ!

> I have tried SSE execution domain fixup pass.
> But I am not able to see any improvements.

Did you actually measure runtime, or did you look at assembly?

> I expect for the example below to use MOVDQA, PAND &c.
> (On nehalem, ANDPS is extremely slower than PAND)

Are you sure? The andps and pand instructions are actually the same speed, but on Nehalem there is a latency penalty for moving data between the int and float domains.

The SSE execution domain pass tries to minimize the extra latency by switching instructions.

In your examples, all the operations are available as either int or float instructions. The instruction selector chooses the float instructions because they are smaller. The SSE execution domain pass does not change them because there are zero domain crossings, zero extra latency. Everything takes place in the float domain which is just as fast.

If you use operations that are only available in one domain, the SSE execution domain pass kicks in:

define <4 x i32> @intfoo(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z)
nounwind readnone {
entry:
 %0 = add <4 x i32> %x, %z
 %not = xor <4 x i32> %z, <i32 -1, i32 -1, i32 -1, i32 -1>
 %1 = and <4 x i32> %not, %y
 %2 = xor <4 x i32> %0, %1
 ret <4 x i32> %2
}

_intfoo:
	movdqa	%xmm0, %xmm3
	paddd	%xmm2, %xmm3
	pandn	%xmm1, %xmm2
	movdqa	%xmm2, %xmm0
	pxor	%xmm3, %xmm0
	ret

All the instructions moved to the int domain because the add forced them.

> Please tell me if something would be wrong for me.

You should measure if LLVM's code is actually slower that the code you want. If it is, I would like to hear.

Our weakness is the shufflevector instruction. It is selected into shufps/pshufd/palign/... only by looking at patterns. The instruction selector does not consider execution domains. This can be a problem because these instructions cannot be freely interchanged by the SSE execution domain pass.


> foo.ll:
> define <4 x i32> @foo(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z)
> nounwind readnone {
> entry:
>  %0 = and <4 x i32> %x, %z
>  %not = xor <4 x i32> %z, <i32 -1, i32 -1, i32 -1, i32 -1>
>  %1 = and <4 x i32> %not, %y
>  %2 = xor <4 x i32> %0, %1
>  ret <4 x i32> %2
> }
> $ llc -mcpu=nehalem -debug-pass=Structure foo.bc -o foo.s
> (snip)
>    Code Placement Optimizater
>    SSE execution domain fixup
>    Machine Natural Loop Construction
>    X86 AT&T-Style Assembly Printer
>    Delete Garbage Collector Information
> 
> foo.s: (edited)
> _foo:
> 	movaps	%xmm0, %xmm3
> 	andps	%xmm2, %xmm3
> 	andnps	%xmm1, %xmm2
> 	movaps	%xmm2, %xmm0
> 	xorps	%xmm3, %xmm0
> 	ret





More information about the llvm-dev mailing list