[llvm-dev] RFC: Pass to prune redundant profiling instrumentation
Vedant Kumar via llvm-dev
llvm-dev at lists.llvm.org
Thu Mar 10 19:21:18 PST 2016
Hi,
I'd like to add a new pass to LLVM which removes redundant profile counter
updates. The goal is to speed up code coverage testing and profile generation
for PGO.
I'm sending this email out to describe my approach, share some early results,
and gather feedback.
Problem Overview
================
A profile counter is redundant if it's incremented in exactly the same basic
blocks as some other profile counter. Consider the following module:
local void f1() {
instrprof_increment(profc_f1);
}
void f2() {
instrprof_increment(profc_f2);
f1();
}
Once the inliner runs and deletes f1, we're left with:
void f2() {
instrprof_increment(profc_f2);
instrprof_increment(profc_f1);
}
Now we can say profc_f1 is redundant (or, an alias for profc_f2).
I've noticed that frontend-based instrumentation can generate many redundant
profile counters. This degrades performance and increases code size. We can
address the problem by getting rid of redundant counter updates. The trick is
to make sure we get back the same profiles.
Proposed Solution
=================
I propose a pruning pass which takes the following steps:
1. Delete functions with local linkage and only one use, if that use is in
a profile data record.
These functions are left around by the inliner (it doesn't know that
they're safe to delete). Deleting them reduces code size and simplifies
subsequent analysis of profile counters.
2. Determine which profile counters are essential.
3. Erase all updates to redundant profile counters.
4. Emit the aliases into a new section in the binary.
Aliases are represented as { Dst: i64*, Src: i64* } tuples. Some changes
in compiler-rt are required to walk the alias section and fill in the
correct execution counts at program exit time.
This pass needs to be run after the inliner in order to be effective.
The complexity of this pass is O(N*M), where N is the number of profile
counters, and M is the average number of updates per counter. In practice it is
a bit faster, since we can skip the analysis of counters which are discovered to
be redundant early on in the process.
Early Results
=============
The pruning pass results in 25% speed improvement in the example program above
(where f2 is called in a loop 10^8 times).
Here is a slightly less contrived example:
#include <vector>
#include <algorithm>
#include <cstdlib>
static void escape(void *p) {
asm volatile("" : : "g"(p) : "memory");
}
int main(int argc, char **argv) {
std::vector<int> V(atoi(argv[1]));
escape(reinterpret_cast<void *>(V.data()));
std::sort(V.begin(), V.end());
return V[0];
}
I get the following results on my desktop (10^8 elements, 5 runs each):
O3: 0.262s
O3 + PGOInstr: 0.663s
O3 + PGOInstr + Pruning: 0.606s (8.6% performance win, 672 aliases)
O3 + CoverageInstr: 0.690s
O3 + CoverageInstr + Pruning: 0.610s (11.6% performance win, 688 aliases)
Next Steps?
===========
Is the performance of instrumented code something we think we need to fix?
What's an acceptable compile-time overhead for running this pruning pass? Is
the general approach a non-starter for anybody?
I'd like to get some feedback and gauge interest before pursuing this further.
Possible next steps include benchmarking instrumented versions of clang and
swift on the relevant test suites, running performance tests from lnt, running
compile-time tests, and measuring any code size differences.
thanks
vedant
More information about the llvm-dev
mailing list