[llvm-dev] RFC: Pass to prune redundant profiling instrumentation
Mehdi Amini via llvm-dev
llvm-dev at lists.llvm.org
Thu Mar 10 19:48:04 PST 2016
> On Mar 10, 2016, at 7:21 PM, Vedant Kumar via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>
> Hi,
>
> I'd like to add a new pass to LLVM which removes redundant profile counter
> updates. The goal is to speed up code coverage testing and profile generation
> for PGO.
>
> I'm sending this email out to describe my approach, share some early results,
> and gather feedback.
>
>
> Problem Overview
> ================
>
> A profile counter is redundant if it's incremented in exactly the same basic
> blocks as some other profile counter. Consider the following module:
>
> local void f1() {
> instrprof_increment(profc_f1);
> }
>
> void f2() {
> instrprof_increment(profc_f2);
> f1();
> }
>
> Once the inliner runs and deletes f1, we're left with:
>
> void f2() {
> instrprof_increment(profc_f2);
> instrprof_increment(profc_f1);
> }
>
> Now we can say profc_f1 is redundant (or, an alias for profc_f2).
>
> I've noticed that frontend-based instrumentation can generate many redundant
> profile counters. This degrades performance and increases code size. We can
> address the problem by getting rid of redundant counter updates. The trick is
> to make sure we get back the same profiles.
>
>
> Proposed Solution
> =================
>
> I propose a pruning pass which takes the following steps:
>
> 1. Delete functions with local linkage and only one use, if that use is in
> a profile data record.
>
> These functions are left around by the inliner (it doesn't know that
> they're safe to delete). Deleting them reduces code size and simplifies
> subsequent analysis of profile counters.
I think that this is something global-DCE should be teached about (if it does not know already).
>
> 2. Determine which profile counters are essential.
What is an "essential counter"?
>
> 3. Erase all updates to redundant profile counters.
When you are saying "erase", you need to actually replace the multiple counter increment with *a new* counter, right? I.e. when replacing:
instrprof_increment(profc_f2);
instrprof_increment(profc_f1);
you need to emit:
instrprof_increment(profc_f2f1_fused);
> 4. Emit the aliases into a new section in the binary.
>
> Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some changes
> in compiler-rt are required to walk the alias section and fill in the
> correct execution counts at program exit time.
>
> This pass needs to be run after the inliner in order to be effective.
>
> The complexity of this pass is O(N*M), where N is the number of profile
> counters, and M is the average number of updates per counter.
What you're describing here is not clear to me? But I may have been totally misunderstanding the whole concept :)
--
Mehdi
> In practice it is
> a bit faster, since we can skip the analysis of counters which are discovered to
> be redundant early on in the process.
>
>
> Early Results
> =============
>
> The pruning pass results in 25% speed improvement in the example program above
> (where f2 is called in a loop 10^8 times).
>
> Here is a slightly less contrived example:
>
> #include <vector>
> #include <algorithm>
> #include <cstdlib>
>
> static void escape(void *p) {
> asm volatile("" : : "g"(p) : "memory");
> }
>
> int main(int argc, char **argv) {
> std::vector<int> V(atoi(argv[1]));
> escape(reinterpret_cast<void *>(V.data()));
> std::sort(V.begin(), V.end());
> return V[0];
> }
>
> I get the following results on my desktop (10^8 elements, 5 runs each):
>
> O3: 0.262s
> O3 + PGOInstr: 0.663s
> O3 + PGOInstr + Pruning: 0.606s (8.6% performance win, 672 aliases)
> O3 + CoverageInstr: 0.690s
> O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 aliases)
>
>
> Next Steps?
> ===========
>
> Is the performance of instrumented code something we think we need to fix?
> What's an acceptable compile-time overhead for running this pruning pass? Is
> the general approach a non-starter for anybody?
>
> I'd like to get some feedback and gauge interest before pursuing this further.
> Possible next steps include benchmarking instrumented versions of clang and
> swift on the relevant test suites, running performance tests from lnt, running
> compile-time tests, and measuring any code size differences.
>
>
> thanks
> vedant
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
More information about the llvm-dev
mailing list