[LLVMdev] -msse3 can degrade performance

Wed Feb 4 11:15:11 PST 2009

On Feb 2, 2009, at 3:00 PM, Jon Harrop wrote:
> On Monday 02 February 2009 20:37:47 you wrote:
>> On Feb 2, 2009, at 12:39 PM, Jon Harrop wrote:
>>> On Monday 02 February 2009 06:10:26 Chris Lattner wrote:
>>>> I'm seeing exactly identical .s files with -msse2 and -msse3 on the
>>>> scimark version I have.  Can you please send the output of:
>>>>
>>>> llvm-gcc -O3 MonteCarlo.c -S -msse2 -o MonteCarlo.2.s
>>>> llvm-gcc -O3 MonteCarlo.c -S -msse3 -o MonteCarlo.3.s
>>>>
>>>> llvm-gcc -O3 MonteCarlo.c -S -msse2 -o MonteCarlo.2.ll -emit-llvm
>>>> llvm-gcc -O3 MonteCarlo.c -S -msse3 -o MonteCarlo.3.ll -emit-llvm
>>>
>>> Can I just check that you had noticed that my timings for those
>>> (sse2 vs sse3)
>>> were the same and that the difference was occurring between -msse
>>> and -msse2
>>> (see below)?
>>
> The x86 output is attached for those (which give the same results  
> here too) as
> well as -O3 and -O3 -msse which give different results here. Here  
> are the
> performance results I just got when redoing this on x86:
>
> MonteCarlo:     Mflops:   212.20   -O3
> MonteCarlo:     Mflops:   211.37   -O3 -msse
> MonteCarlo:     Mflops:   123.70   -O3 -msse2
> MonteCarlo:     Mflops:   127.22   -O3 -msse3

Ok, thanks Jon!  I diff'd the files and the -msse2 and -msse3 code is  
identical, so we're not doing anything wrong with -msse3 :).

OTOH, the perf drop from sse -> sse2 is concerning.  The difference  
here is that we do double math in SSE regs instead of FPStack regs.   
In this case, using the fp stack avoids some cross-class register  
copying.  We could improve the code generator to notice and handle  
this, I added this note to the x86 backend with some details:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20090202/073254.html

This is a long-known issue, but a great example of it.

> Two other points of interest:
>
> . I just retimed in x64 and could not reproduce the difference so  
> this only
> afflicts x86 and not x64 as I had said previously.

Right, this occurs because of the x86-32 ABI.  x86-64 should not be  
affected.

> . Pulling the whole benchmark into a single compilation unit changes  
> the
> performance results completely (still x86):
>
> $ llvm-gcc -O3 -msse3 -lm all.c -o all
> $ ./all
> Composite Score:          570.07
> FFT             Mflops:   599.40    (N=1024)
> SOR             Mflops:   476.97    (100 x 100)
> MonteCarlo:     Mflops:   278.17
> Sparse matmult  Mflops:   582.54    (N=1000, nz=5000)
> LU              Mflops:   913.27    (M=100, N=100)
> $ gcc -O3 -msse3 -lm all.c -o all
> $ ./all
> Composite Score:          539.20
> FFT             Mflops:   516.05    (N=1024)
> SOR             Mflops:   472.29    (100 x 100)
> MonteCarlo:     Mflops:   167.25
> Sparse matmult  Mflops:   633.20    (N=1000, nz=5000)
> LU              Mflops:   907.20    (M=100, N=100)
>
> Note that llvm-gcc is achieving almost 280MFLOPS on MonteCarlo here,  
> far
> higher than any competitors, and it is outperforming gcc overall.

Great!  Do you see the same results with LTO?  Inlining  
Random_nextDouble from random.c to MonteCarlo.c should be a big win.

-Chris