[llvm-dev] AutoFDO sample profiles v. SelectInst,

Xinliang David Li via llvm-dev llvm-dev at lists.llvm.org
Fri Aug 12 11:15:32 PDT 2016


There are two potential problems:

1) the branch gets eliminated in the binary that is being profiled, so
there is no profile data
2) select instruction is lowered into branch -- but the branch profile data
is not annotated back to the select instruction.

2) is something that can be improved in SampleFDO.

On Fri, Aug 12, 2016 at 10:06 AM, David Callahan via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> I am looking for advice on a problem observed with
> -fprofile-sample-use for samples built with the AutoFDO tool
> I took the "hmmer" benchmark out of SPEC2006
> It is initially compiled
>    clnag++ -o hmmer -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG
> -fno-strict-aliasing -w -g *.c
> This baseline binary runs in about 164.2 seconds as reported by "perf stat"
> We build a sample file from this program using the AutoFDO tool
> "create_llvm_prof"
>    perf report -b hmmer nph3.hmm swiss41wa

perf record ?

>    create_llvm_prof -out hmmer.llvm ...
> and rebuild the binary using this profile
>    clnag++ -o hmmer-fdo -fprofile-sample-use=hmmer.llvm \
>            -O3 -std=gnu89 -DSPEC_CPU -DNDEBUG -fno-strict-aliasing -w -g
> *.c
> now, sadly, this program runs in 231.2 seconds.
> The problem is that when a short conditional block is converted to a
> SelectInst, we are unable to accurately recover the branch frequencies
> since there is no actual branching. When we then compile in the
> presence of the sample, phase "CodeGen Prepare" examines the profile
> data and undoes the select conversion to disastrous results.
This looks like a bug here -- is it likely that selectInst somehow gets
annotated with bad profile data ? Should it make the same decision as if
autoFDO is not used?

A smaller reproducible will be helpful here.

> If we compile -O0 for training, and then use the profile now with
> accurate branch weights, the program runs in 149.5
> seconds. Unfortunately, of course, the training program runs in 501.4
> seconds.
> Alternately, if we disable the original select conversion performed in
> SpeculativelyExecuteBB in SimplifyCFG.cpp so the original control is
> visible to sampling, the training program now runs in 229.7 seconds and
> the optimized program runs in 151.5, so we recover essentially all of
> lost information.
> Of course both if these options are unfortunate because they alter the
> workflow where it would be preferable to be able to monitor the
> production codes to feed back into production builds. That suggests
> that we remove the use of profile data in the CodeGen Prepare
> phase. When that change is made, and we sample the baseline -O3
> binary, the resulting optimized binary runs in 158.9 seconds.
> That result is at least slightly better than baseline instead of much
> worse but we are leaving 2-3% on the table. Maybe that is a reasonable
> trade-off for having only production builds.
> Any advice or suggestions?

Please file a bug with something to reproduce : preprocessed file, compiler
command line, and profile data in text form.


> Thanks
> david
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160812/3e861dce/attachment.html>

More information about the llvm-dev mailing list