[cfe-dev] Performance disparity between clang/LLVM and GCC when using libjpeg-turbo

Wed May 15 22:32:10 PDT 2013

Hi.  I maintain libjpeg-turbo, a heavily-accelerated fork of libjpeg for 
x86/x86-64 and ARM systems.  A large part of our speedup comes from 
assembly code, but our Huffman codec relies heavily on C compiler 
optimizations to achieve peak performance.  After upgrading to OS X 
10.8, which uses Clang/LLVM as the default compiler rather than GCC, I 
observed a slowdown of 15-20% when compressing images using 
libjpeg-turbo, and it seems to be due to the compiler having trouble 
optimizing said Huffman codec.  I'll walk you through the steps to 
reproduce the issue:

NOTE:  this is probably reproducible on other platforms, such as Linux, 
as well.  I haven't tested it.

Prerequisites:
-- Xcode 4.5.x installed under /Applications/Xcode.app
-- nasm, automake, autoconf, and apple-gcc42 from MacPorts installed 
under /opt/local
-- artificial.ppm from 
http://www.imagecompression.info/test_images/rgb8bit.zip

xcrun svn co svn://svn.code.sf.net/p/libjpeg-turbo/code/trunk libjpeg-turbo
cd libjpeg-turbo
/opt/local/bin/autoreconf -fiv

mkdir osx.64.clang
cd osx.64.clang
sh ../configure --host x86_64-apple-darwin NASM=/opt/local/bin/nasm 
CC='xcrun clang' CFLAGS=-O4
./tjbench {path_to}/artificial.ppm 95 -rgb -quiet

mkdir osx.64.llvmgcc
cd osx.64.llvmgcc
sh ../configure --host x86_64-apple-darwin NASM=/opt/local/bin/nasm 
CC='xcrun gcc' CFLAGS=-O3
./tjbench {path_to}/artificial.ppm 95 -rgb -quiet

mkdir osx.64.gcc42
cd osx.64.gcc42
sh ../configure --host x86_64-apple-darwin NASM=/opt/local/bin/nasm 
CC=/opt/local/bin/gcc-apple-4.2 CFLAGS=-O3
./tjbench {path_to}/artificial.ppm 95 -rgb -quiet

mkdir osx.32.clang
cd osx.32.clang
sh ../configure --host i686-apple-darwin NASM=/opt/local/bin/nasm 
CC='xcrun clang' CFLAGS='-m32 -O4' LDFLAGS=-m32
./tjbench {path_to}/artificial.ppm 95 -rgb -quiet

mkdir osx.32.llvmgcc
cd osx.32.llvmgcc
sh ../configure --host i686-apple-darwin NASM=/opt/local/bin/nasm 
CC='xcrun gcc' CFLAGS='-m32 -O3' LDFLAGS=-m32
./tjbench {path_to}/artificial.ppm 95 -rgb -quiet

mkdir osx.32.gcc42
cd osx.32.gcc42
sh ../configure --host i686-apple-darwin NASM=/opt/local/bin/nasm 
CC=/opt/local/bin/gcc-apple-4.2 CFLAGS='-O3 -m32' LDFLAGS=-m32
./tjbench {path_to}/artificial.ppm 95 -rgb -quiet

A spreadsheet of my results is attached.  Note that decompression 
performance is generally better across the board with Clang/LLVM, but 
compression performance is generally worse.  Note also that, when using 
the GCC front end to LLVM, the performance is somewhere in the middle, 
so it seems that part of the issue may be in Clang and part of it may be 
in LLVM.

If there are things I can do within the inner loops of jchuff.c to make 
it perform better under Clang/LLVM, I am definitely open to that.

DRC
-------------- next part --------------
A non-text attachment was scrubbed...
Name: libjpegturbo-1.3.ods
Type: application/vnd.oasis.opendocument.spreadsheet
Size: 13981 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20130516/995bd01a/attachment.ods>