[llvm-dev] AutoFDO sample profiles v. SelectInst,

David Callahan via llvm-dev llvm-dev at lists.llvm.org
Mon Aug 15 16:54:07 PDT 2016

I field two bugs

Which appear different but may be related.

From: Xinliang David Li <xinliangli at gmail.com<mailto:xinliangli at gmail.com>>
Date: Friday, August 12, 2016 at 11:15 AM
To: David Callahan <dcallahan at fb.com<mailto:dcallahan at fb.com>>
Cc: LLVM Dev Mailing list <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>, Dehao Chen <dehao at google.com<mailto:dehao at google.com>>
Subject: Re: [llvm-dev] AutoFDO sample profiles v. SelectInst,


There are two potential problems:

1) the branch gets eliminated in the binary that is being profiled, so there is no profile data
2) select instruction is lowered into branch -- but the branch profile data is not annotated back to the select instruction.

2) is something that can be improved in SampleFDO.

On Fri, Aug 12, 2016 at 10:06 AM, David Callahan via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

I am looking for advice on a problem observed with
-fprofile-sample-use for samples built with the AutoFDO tool

I took the "hmmer" benchmark out of SPEC2006
It is initially compiled

   clnag++ -o hmmer -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g *.c

This baseline binary runs in about 164.2 seconds as reported by "perf stat"

We build a sample file from this program using the AutoFDO tool "create_llvm_prof"

   perf report -b hmmer nph3.hmm swiss41wa

perf record ?

   create_llvm_prof -out hmmer.llvm ...

and rebuild the binary using this profile

   clnag++ -o hmmer-fdo -fprofile-sample-use=hmmer.llvm \
           -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g *.c

now, sadly, this program runs in 231.2 seconds.

The problem is that when a short conditional block is converted to a
SelectInst, we are unable to accurately recover the branch frequencies
since there is no actual branching. When we then compile in the
presence of the sample, phase "CodeGen Prepare" examines the profile
data and undoes the select conversion to disastrous results.

This looks like a bug here -- is it likely that selectInst somehow gets annotated with bad profile data ? Should it make the same decision as if autoFDO is not used?

A smaller reproducible will be helpful here.

If we compile -O0 for training, and then use the profile now with
accurate branch weights, the program runs in 149.5
seconds. Unfortunately, of course, the training program runs in 501.4

Alternately, if we disable the original select conversion performed in
SpeculativelyExecuteBB in SimplifyCFG.cpp so the original control is
visible to sampling, the training program now runs in 229.7 seconds and
the optimized program runs in 151.5, so we recover essentially all of
lost information.

Of course both if these options are unfortunate because they alter the
workflow where it would be preferable to be able to monitor the
production codes to feed back into production builds. That suggests
that we remove the use of profile data in the CodeGen Prepare
phase. When that change is made, and we sample the baseline -O3
binary, the resulting optimized binary runs in 158.9 seconds.

That result is at least slightly better than baseline instead of much
worse but we are leaving 2-3% on the table. Maybe that is a reasonable
trade-off for having only production builds.

Any advice or suggestions?

Please file a bug with something to reproduce : preprocessed file, compiler command line, and profile data in text form.



LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160815/171e9d48/attachment.html>

More information about the llvm-dev mailing list