[LLVMdev] -msse3 can degrade performance
Chris Lattner
clattner at apple.com
Wed Feb 4 11:15:11 PST 2009
On Feb 2, 2009, at 3:00 PM, Jon Harrop wrote:
> On Monday 02 February 2009 20:37:47 you wrote:
>> On Feb 2, 2009, at 12:39 PM, Jon Harrop wrote:
>>> On Monday 02 February 2009 06:10:26 Chris Lattner wrote:
>>>> I'm seeing exactly identical .s files with -msse2 and -msse3 on the
>>>> scimark version I have. Can you please send the output of:
>>>>
>>>> llvm-gcc -O3 MonteCarlo.c -S -msse2 -o MonteCarlo.2.s
>>>> llvm-gcc -O3 MonteCarlo.c -S -msse3 -o MonteCarlo.3.s
>>>>
>>>> llvm-gcc -O3 MonteCarlo.c -S -msse2 -o MonteCarlo.2.ll -emit-llvm
>>>> llvm-gcc -O3 MonteCarlo.c -S -msse3 -o MonteCarlo.3.ll -emit-llvm
>>>
>>> Can I just check that you had noticed that my timings for those
>>> (sse2 vs sse3)
>>> were the same and that the difference was occurring between -msse
>>> and -msse2
>>> (see below)?
>>
> The x86 output is attached for those (which give the same results
> here too) as
> well as -O3 and -O3 -msse which give different results here. Here
> are the
> performance results I just got when redoing this on x86:
>
> MonteCarlo: Mflops: 212.20 -O3
> MonteCarlo: Mflops: 211.37 -O3 -msse
> MonteCarlo: Mflops: 123.70 -O3 -msse2
> MonteCarlo: Mflops: 127.22 -O3 -msse3
Ok, thanks Jon! I diff'd the files and the -msse2 and -msse3 code is
identical, so we're not doing anything wrong with -msse3 :).
OTOH, the perf drop from sse -> sse2 is concerning. The difference
here is that we do double math in SSE regs instead of FPStack regs.
In this case, using the fp stack avoids some cross-class register
copying. We could improve the code generator to notice and handle
this, I added this note to the x86 backend with some details:
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20090202/073254.html
This is a long-known issue, but a great example of it.
> Two other points of interest:
>
> . I just retimed in x64 and could not reproduce the difference so
> this only
> afflicts x86 and not x64 as I had said previously.
Right, this occurs because of the x86-32 ABI. x86-64 should not be
affected.
> . Pulling the whole benchmark into a single compilation unit changes
> the
> performance results completely (still x86):
>
> $ llvm-gcc -O3 -msse3 -lm all.c -o all
> $ ./all
> Composite Score: 570.07
> FFT Mflops: 599.40 (N=1024)
> SOR Mflops: 476.97 (100 x 100)
> MonteCarlo: Mflops: 278.17
> Sparse matmult Mflops: 582.54 (N=1000, nz=5000)
> LU Mflops: 913.27 (M=100, N=100)
> $ gcc -O3 -msse3 -lm all.c -o all
> $ ./all
> Composite Score: 539.20
> FFT Mflops: 516.05 (N=1024)
> SOR Mflops: 472.29 (100 x 100)
> MonteCarlo: Mflops: 167.25
> Sparse matmult Mflops: 633.20 (N=1000, nz=5000)
> LU Mflops: 907.20 (M=100, N=100)
>
> Note that llvm-gcc is achieving almost 280MFLOPS on MonteCarlo here,
> far
> higher than any competitors, and it is outperforming gcc overall.
Great! Do you see the same results with LTO? Inlining
Random_nextDouble from random.c to MonteCarlo.c should be a big win.
-Chris
More information about the llvm-dev
mailing list