[PATCH] D101229: [InlineCost] Bump threshold for inlining cold callsites (PR50099)
Xinliang David Li via llvm-commits
llvm-commits at lists.llvm.org
Sat Apr 24 11:28:25 PDT 2021
Changing the default threshold needs lots of benchmarking.
For this particular case, IMO the better way is to enhance inline cost
analysis to give callsite more bonus if it enables SROA in call context.
The analysis needs to be careful such that if there is another callsite
that blocks SROA, and that callsites can never be inlined, then the bonus
can not be applied.
David
On Sat, Apr 24, 2021 at 4:10 AM Roman Lebedev via Phabricator <
reviews at reviews.llvm.org> wrote:
> lebedev.ri created this revision.
> lebedev.ri added reviewers: aeubanks, eraman, Prazek, davidxl.
> lebedev.ri added a project: LLVM.
> Herald added subscribers: haicheng, hiraditya.
> lebedev.ri requested review of this revision.
>
> I'm observing a rather big runtime performance regression
> as a result of NewPM switch on one of RawSpeed's benchmarks:
>
> raw.pixls.us-unique/Panasonic/DC-GH5S$
> /repositories/googlebenchmark/tools/compare.py -a benchmarks
> ~/rawspeed/build-{old,new}/src/utilities/rsbench/rsbench
> --benchmark_counters_tabular=true P1022085.RW2 --benchmark_repetitions=9
> --benchmark_min_time=1
> RUNNING:
> /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench
> --benchmark_counters_tabular=true P1022085.RW2 --benchmark_repetitions=9
> --benchmark_min_time=1 --benchmark_display_aggregates_only=true
> --benchmark_out=/tmp/tmpk9pkqe2s
> 2021-04-24T14:00:17+03:00
> Running /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench
> Run on (32 X 3599.99 MHz CPU s)
> CPU Caches:
> L1 Data 32 KiB (x16)
> L1 Instruction 32 KiB (x16)
> L2 Unified 512 KiB (x16)
> L3 Unified 32768 KiB (x2)
> Load Average: 0.65, 0.51, 1.27
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Benchmark Time
> CPU Iterations CPUTime,s CPUTime/WallTime Pixels
> Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> P1022085.RW2/threads:32/process_time/real_time_mean 0.748 ms
> 23.9 ms 9 0.0239231 31.9721 10.3933M
> 434.452M 13.8904G 41.801 1.33647k 748.25u
> P1022085.RW2/threads:32/process_time/real_time_median 0.748 ms
> 23.9 ms 9 0.0239156 31.9716 10.3933M
> 434.585M 13.8934G 41.8138 1.33676k 748.079u
> P1022085.RW2/threads:32/process_time/real_time_stddev 0.003 ms
> 0.080 ms 9 80.0846u 6.00073m 0
> 1.45335M 48.9162M 0.139834 4.7065 2.63684u
> RUNNING:
> /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
> --benchmark_counters_tabular=true P1022085.RW2 --benchmark_repetitions=9
> --benchmark_min_time=1 --benchmark_display_aggregates_only=true
> --benchmark_out=/tmp/tmpt6ijfryg
> 2021-04-24T14:00:31+03:00
> Running /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
> Run on (32 X 3600.05 MHz CPU s)
> CPU Caches:
> L1 Data 32 KiB (x16)
> L1 Instruction 32 KiB (x16)
> L2 Unified 512 KiB (x16)
> L3 Unified 32768 KiB (x2)
> Load Average: 5.54, 1.56, 1.61
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Benchmark Time
> CPU Iterations CPUTime,s CPUTime/WallTime Pixels
> Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> P1022085.RW2/threads:32/process_time/real_time_mean 0.851 ms
> 27.2 ms 9 0.0272077 31.9615 10.3933M
> 382.027M 12.2102G 36.7569 1.17481k 851.271u
> P1022085.RW2/threads:32/process_time/real_time_median 0.848 ms
> 27.1 ms 9 0.0271017 31.9699 10.3933M
> 383.494M 12.2598G 36.8981 1.17959k 847.755u
> P1022085.RW2/threads:32/process_time/real_time_stddev 0.008 ms
> 0.243 ms 9 243.106u 0.0215795 0
> 3.38806M 116.08M 0.325984 11.1687 8.16022u
> Comparing
> /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench to
> /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
> Benchmark
> Time CPU Time Old Time New CPU Old CPU
> New
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> P1022085.RW2/threads:32/process_time/real_time_pvalue
> 0.0004 0.0004 U Test, Repetitions: 9 vs 9
> P1022085.RW2/threads:32/process_time/real_time_mean
> +0.1377 +0.1373 1 1 24
> 27
> P1022085.RW2/threads:32/process_time/real_time_median
> +0.1332 +0.1333 1 1 24
> 27
> P1022085.RW2/threads:32/process_time/real_time_stddev
> +2.0947 +2.0384 0 0 0
> 0
>
> I've posted repro IR at https://bugs.llvm.org/show_bug.cgi?id=50099.
> It happens because certain destructor, that runs at the end of certain
> function,
> is no longer inlined, thus preventing SROA of the class.
>
> I guess this bisects to D28331 <https://reviews.llvm.org/D28331>, which
> added the lower threshold for cold calls,
> with comment:
>
> In D28331 <https://reviews.llvm.org/D28331>, @eraman wrote:
>
> > I've tuned the thresholds for the hot and cold callsites using a hacked
> up version of the old inliner that explicitly computes BFI on a set of
> internal benchmarks and spec. Once the new PM based pipeline stabilizes
> (IIRC Chandler mentioned there are known issues) I'll benchmark this again
> and adjust the thresholds if required.
>
> Since the values haven't been changed since then, i guess that didn't
> happen as of yet.
>
> Analyzing the problem, on top of D101228 <https://reviews.llvm.org/D101228>,
> it would cost `50` to inline it, while the threshold is `45`.
> Bumping it to `50` isn't sufficient, because the threshold is
> non-inclusive (is that intentional?),
> so i rounded it up to `55`.
>
> This addresses the issue, inlining happens, SROA happens, and the perf is
> happy.
>
> While this seems like the least problematic approach,
> i think we may want to somehow boost inlining of the callees
> that have arguments that are marked as likely to be SROA'ble.
> Perhaps we don't want to apply this budget-lowering logic in such cases at
> least?
>
>
> Repository:
> rG LLVM Github Monorepo
>
> https://reviews.llvm.org/D101229
>
> Files:
> llvm/lib/Analysis/InlineCost.cpp
> llvm/test/Transforms/Inline/X86/inline-cold-callsite.ll
>
>
> Index: llvm/test/Transforms/Inline/X86/inline-cold-callsite.ll
> ===================================================================
> --- /dev/null
> +++ llvm/test/Transforms/Inline/X86/inline-cold-callsite.ll
> @@ -0,0 +1,37 @@
> +; RUN: opt < %s -inline -debug-only=inline-cost
> -print-instruction-comments -S -mtriple=x86_64-unknown-linux-gnu 2>&1 |
> FileCheck %s
> +; RUN: opt < %s -passes='cgscc(inline)' -debug-only=inline-cost
> -print-instruction-comments -S -mtriple=x86_64-unknown-linux-gnu 2>&1 |
> FileCheck %s
> +
> +; REQUIRES: asserts
> +
> +; Check the threshold for inlining cold callsites.
> +
> +; CHECK: Analyzing call of cold_callee... (caller:caller)
> +; CHECK-NEXT: Cold callsite
> +; CHECK-NEXT: define void @cold_callee() {
> +; CHECK-NEXT: ; cost before = -30, cost after = -30, threshold before =
> 51, threshold after = 51, cost delta = 0
> +; CHECK-NEXT: ret void
> +; CHECK-NEXT: }
> +
> +declare void @hot_callee()
> +
> +define void @cold_callee() {
> + ret void
> +}
> +
> +define void @caller(i1 %c) {
> +entry:
> + br i1 %c, label %hot, label %cold, !prof !0
> +
> +hot:
> + call void @hot_callee()
> + br label %end
> +
> +cold:
> + call void @cold_callee()
> + br label %end
> +
> +end:
> + ret void
> +}
> +
> +!0 = !{!"branch_weights", i32 1000000, i32 1}
> Index: llvm/lib/Analysis/InlineCost.cpp
> ===================================================================
> --- llvm/lib/Analysis/InlineCost.cpp
> +++ llvm/lib/Analysis/InlineCost.cpp
> @@ -68,7 +68,7 @@
>
> static cl::opt<int>
> ColdCallSiteThreshold("inline-cold-callsite-threshold", cl::Hidden,
> - cl::init(45), cl::ZeroOrMore,
> + cl::init(55), cl::ZeroOrMore,
> cl::desc("Threshold for inlining cold
> callsites"));
>
> static cl::opt<bool> InlineEnableCostBenefitAnalysis(
> @@ -89,7 +89,7 @@
> // PGO before we actually hook up inliner with analysis passes such as
> BPI and
> // BFI.
> static cl::opt<int> ColdThreshold(
> - "inlinecold-threshold", cl::Hidden, cl::init(45), cl::ZeroOrMore,
> + "inlinecold-threshold", cl::Hidden, cl::init(55), cl::ZeroOrMore,
> cl::desc("Threshold for inlining functions with cold attribute"));
>
> static cl::opt<int>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210424/5c9f073e/attachment.html>
More information about the llvm-commits
mailing list