[llvm-dev] AutoFDO sample profiles v. SelectInst,

David Callahan via llvm-dev llvm-dev at lists.llvm.org
Fri Aug 12 10:06:08 PDT 2016


I am looking for advice on a problem observed with
-fprofile-sample-use for samples built with the AutoFDO tool

I took the "hmmer" benchmark out of SPEC2006
It is initially compiled

   clnag++ -o hmmer -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g *.c

This baseline binary runs in about 164.2 seconds as reported by "perf stat"

We build a sample file from this program using the AutoFDO tool "create_llvm_prof"

   perf report -b hmmer nph3.hmm swiss41wa
   create_llvm_prof -out hmmer.llvm ...

and rebuild the binary using this profile

   clnag++ -o hmmer-fdo -fprofile-sample-use=hmmer.llvm \
           -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g *.c

now, sadly, this program runs in 231.2 seconds.

The problem is that when a short conditional block is converted to a
SelectInst, we are unable to accurately recover the branch frequencies
since there is no actual branching. When we then compile in the
presence of the sample, phase "CodeGen Prepare" examines the profile
data and undoes the select conversion to disastrous results.

If we compile -O0 for training, and then use the profile now with
accurate branch weights, the program runs in 149.5
seconds. Unfortunately, of course, the training program runs in 501.4
seconds.

Alternately, if we disable the original select conversion performed in
SpeculativelyExecuteBB in SimplifyCFG.cpp so the original control is
visible to sampling, the training program now runs in 229.7 seconds and
the optimized program runs in 151.5, so we recover essentially all of
lost information.

Of course both if these options are unfortunate because they alter the
workflow where it would be preferable to be able to monitor the
production codes to feed back into production builds. That suggests
that we remove the use of profile data in the CodeGen Prepare
phase. When that change is made, and we sample the baseline -O3
binary, the resulting optimized binary runs in 158.9 seconds.

That result is at least slightly better than baseline instead of much
worse but we are leaving 2-3% on the table. Maybe that is a reasonable
trade-off for having only production builds.

Any advice or suggestions?
Thanks
david
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160812/1c25ed87/attachment.html>


More information about the llvm-dev mailing list