[LLVMdev] How does SSEDomainFix work?

Tue May 11 10:05:49 PDT 2010

Dear Jakob-san,

> ようこそ！

:D

Thank you for reply. At first, I have to apologize you.
I misunderstood aim of SSEdomainfix.
Now I see what the pass does.

But anyway, the point that I would like to mention is "throughput"
rather than (inter-domain) latency.
In fact, FP operations are 3x slower than SI ops on Nehalem by my measurement.
It would be needed to prefer SI ops on Nehalem(and generic sse2), I think.
(Shorter instructions may be taken with -Os)

The attachment includes a simple(but stupid bogus) asm-C source and a
Win32 executable.
$ mingw32-gcc -msse2 -O4 -Wall -funroll-all-loops foo.c
It must be compiled on other x86 hosts.
But it would be needed to constrain processor's affinity to single ; )

Counts below are Cycles by million iteration on Core i7
982270 xorps
982231 movaps
371671 pxor
342628 movdqa

SI ops can be issued by 3-way but FP ops by only single way.
(as we know, they are nearly same on Conroe, Penryn)
Excuse me, loads by movdqa and movaps are not measured. : (

See also;
- Intel optimization manual
  http://www.intel.com/assets/pdf/manual/248966.pdf
- Agner's works
  http://agner.org/optimize/

Thank you,
Takumi

2010/5/12 Jakob Stoklund Olesen <stoklund at 2pi.dk>:
>
> On May 10, 2010, at 9:07 PM, NAKAMURA Takumi wrote:
>
>> Hello. This is my 1st post.
>
> ようこそ！
>
>> I have tried SSE execution domain fixup pass.
>> But I am not able to see any improvements.
>
> Did you actually measure runtime, or did you look at assembly?
>
>> I expect for the example below to use MOVDQA, PAND &c.
>> (On nehalem, ANDPS is extremely slower than PAND)
>
> Are you sure? The andps and pand instructions are actually the same speed, but on Nehalem there is a latency penalty for moving data between the int and float domains.
>
> The SSE execution domain pass tries to minimize the extra latency by switching instructions.
>
> In your examples, all the operations are available as either int or float instructions. The instruction selector chooses the float instructions because they are smaller. The SSE execution domain pass does not change them because there are zero domain crossings, zero extra latency. Everything takes place in the float domain which is just as fast.
>
> If you use operations that are only available in one domain, the SSE execution domain pass kicks in:
>
> define <4 x i32> @intfoo(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z)
> nounwind readnone {
> entry:
>  %0 = add <4 x i32> %x, %z
>  %not = xor <4 x i32> %z, <i32 -1, i32 -1, i32 -1, i32 -1>
>  %1 = and <4 x i32> %not, %y
>  %2 = xor <4 x i32> %0, %1
>  ret <4 x i32> %2
> }
>
> _intfoo:
>        movdqa  %xmm0, %xmm3
>        paddd   %xmm2, %xmm3
>        pandn   %xmm1, %xmm2
>        movdqa  %xmm2, %xmm0
>        pxor    %xmm3, %xmm0
>        ret
>
> All the instructions moved to the int domain because the add forced them.
>
>> Please tell me if something would be wrong for me.
>
> You should measure if LLVM's code is actually slower that the code you want. If it is, I would like to hear.
>
> Our weakness is the shufflevector instruction. It is selected into shufps/pshufd/palign/... only by looking at patterns. The instruction selector does not consider execution domains. This can be a problem because these instructions cannot be freely interchanged by the SSE execution domain pass.
>
>
>> foo.ll:
>> define <4 x i32> @foo(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z)
>> nounwind readnone {
>> entry:
>>  %0 = and <4 x i32> %x, %z
>>  %not = xor <4 x i32> %z, <i32 -1, i32 -1, i32 -1, i32 -1>
>>  %1 = and <4 x i32> %not, %y
>>  %2 = xor <4 x i32> %0, %1
>>  ret <4 x i32> %2
>> }
>> $ llc -mcpu=nehalem -debug-pass=Structure foo.bc -o foo.s
>> (snip)
>>    Code Placement Optimizater
>>    SSE execution domain fixup
>>    Machine Natural Loop Construction
>>    X86 AT&T-Style Assembly Printer
>>    Delete Garbage Collector Information
>>
>> foo.s: (edited)
>> _foo:
>>       movaps  %xmm0, %xmm3
>>       andps   %xmm2, %xmm3
>>       andnps  %xmm1, %xmm2
>>       movaps  %xmm2, %xmm0
>>       xorps   %xmm3, %xmm0
>>       ret
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xmm.zip
Type: application/zip
Size: 2911 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20100512/d9715376/attachment.zip>