[llvm-dev] [XRay] Build instrumented Clang, some analysis results
Dean Michael Berris via llvm-dev
llvm-dev at lists.llvm.org
Wed Jul 20 00:48:02 PDT 2016
Hi everyone,
TL;DR: With current pending patches applied in compiler-rt and llvm, and trunk clang, you can build your application with XRay tracing enabled on Linux with tracing enabled before main starts, and logging stops when the main thread exits.
Just a quick update, I have some patches under review that when applied cleanly to LLVM and compiler-rt allows for building applications with XRay tracing enabled on Linux binaries from start to end. The list of patches to apply are:
* https://reviews.llvm.org/D21987 (accounting tool, lives in llvm)
* https://reviews.llvm.org/D21612 then https://reviews.llvm.org/D21982 (compiler-rt XRay runtime implementation)
To test this out, I built clang with XRay and ran it on a hello world application. Here's how I did it:
1) Build clang+compiler-rt with the above patches patched in. Follow the normal instructions for getting a version of clang (from trunk) built. I use CMake+Ninja to do it, and I build everything. Example session:
# assume ./llvm is set up appropriately
$ cd llvm-build
$ cmake -DCMAKE_BUILD_TYPE=Release -G Ninja ../llvm
$ ninja
$ cd ..
2) Build clang again, preferably in another build directory where you can use the clang and compiler-rt built in #1, this time with the following additional flags for CMAKE_CXX_FLAGS: `-fxray-instrument` `-fxray-instruction-threshold=1`. The second flag tells clang+llvm to instrument functions that have at least one instruction in it. An example session of this would look something like:
$ mkdir xray-llvm-build
$ cd xray-llvm-build
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=../llvm-build/bin/clang++ -DCMAKE_C_COMPILER=../llvm-build/bin/clang -DCMAKE_CXX_FLAGS='-fxray-instrument -fxray-instruction-threshold=1 -g' -DBUILD_SHARED_LIBS=Off ../llvm
$ ninja clang
$ cd ..
3) Run clang on a sample program, configure XRAY_OPTIONS while doing so for verbosity control. In my case I have the following "Hello, World!" program compiled by clang:
---->8 test.cc 8<----
#include <iostream>
int main(int argc, char* argv[]) {
std::cout << "I'm being compiled with XRay!" << std::endl;
}
---->8 test.cc 8<----
$ XRAY_OPTIONS='verbosity=1' xray-llvm-build/bin/clang -o test.bin -O3 -x c++ -stdlib=libstdc++ test.cc
You should then see two lines similar to the following printed:
XRay: Log file in 'xray-log.DqqGQk'
4) Once you have one of those XRay log files, we can then use the 'xray-fc-account' tool built in the llvm-build/bin directory (in step 1). Warning: there's a lot of output in this step, I suggest piping to files. Example session:
$ llvm-build/bin/xray-fc-account xray-log.DqqGQk -m xray-llvm-build/bin/clang -k >sample.txt 2>errs.txt
We can even output it to CSV, which allows us to load the output in a spreadsheet application where we can start sorting and analysing the data as we see fit, by adding the "-format" flag to the above command:
$ llvm-build/bin/xray-fc-account xray-log.DqqGQk -m xray-llvm-build/bin/clang -k -format=csv >sample.csv 2>errs.txt
Analysis Results
====
Given the above steps, I've done some quick analysis with my favourite spreadsheet application where I'm showing the top results for the Clang run above. Here's some interesting findings:
- The top 20 functions called in clang (in the above invocation) are:
292174 clang::DeclContext::getPrimaryContext()
283681 llvm::FoldingSetNodeID::AddPointer(void const*)
214706 clang::Preprocessor::Lex(clang::Token&)
193352 llvm::FoldingSetNodeID::AddInteger(unsigned int)
161706 clang::Lexer::LexTokenInternal(clang::Token&, bool)
161055 clang::Lexer::Lex(clang::Token&)
143420 clang::DeclarationName::getNameKind() const
142071 llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul>::Allocate(unsigned long, unsigned long)
139009 clang::NamespaceDecl::getOriginalNamespace()
125706 clang::Redeclarable<clang::TagDecl>::DeclLink::getNext(clang::TagDecl const*) const
107928 clang::Type::getAsTagDecl() const
107785 clang::Type::getAsCXXRecordDecl() const
101150 llvm::FoldingSetNodeID::operator==(llvm::FoldingSetNodeID const&) const
96244 llvm::hashing::detail::hash_short(char const*, unsigned long, unsigned long)
94872 _ZN4llvm7hashing6detail23hash_combine_range_implIKjEENSt9enable_ifIXsr16is_hashable_dataIT_EE5valueENS_9hash_codeEE4typeEPS5_S9_
91536 clang::DeclContext::lookup(clang::DeclarationName) const
84122 clang::TypeLoc::getNextTypeLocImpl(clang::TypeLoc)
81368 clang::TypeLoc::getLocalSourceRangeImpl(clang::TypeLoc)
80626 clang::FunctionDecl::getCanonicalDecl()
79489 clang::TagDecl::getDefinition() const
77347 clang::Lexer::LexIdentifier(clang::Token&, char const*)
- The top few slowest (filtered by hand, because, see caveats below):
1. main
2. clang::Parser::ParseNamespace(unsigned int, clang::SourceLocation&, clang::SourceLocation)
3. clang::Parser::ParseInnerNamespace(std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> >&, std::vector<clang::IdentifierInfo*, std::allocator<clang::IdentifierInfo*> >&, std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> >&, unsigned int, clang::SourceLocation&, clang::ParsedAttributes&, clang::BalancedDelimiterTracker&)
4. clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&)
5. clang::Parser::ParseDeclaration(unsigned int, clang::SourceLocation&, clang::Parser::ParsedAttributesWithRange&)
6. clang::Parser::ParseDeclarationStartingWithTemplate(unsigned int, clang::SourceLocation&, clang::AccessSpecifier, clang::AttributeList*)
7. clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&)
8. clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::DataLayout const&, llvm::Module*, clang::BackendAction, llvm::raw_pwrite_stream*)
9. clang::Parser::ParseExternalDeclaration(clang::Parser::ParsedAttributesWithRange&, clang::ParsingDeclSpec*)
10. clang::Parser::ParseTemplateDeclarationOrSpecialization(unsigned int, clang::SourceLocation&, clang::AccessSpecifier, clang::AttributeList*)
11. clang::Parser::ParseSingleDeclarationAfterTemplate(unsigned int, clang::Parser::ParsedTemplateInfo const&, clang::ParsingDeclRAIIObject&, clang::SourceLocation&, clang::AccessSpecifier, clang::AttributeList*)
- For a single-threaded application, it ran on three "cpus":
CPUID min tsc max tsc duration
5 14,393,810,869,713,800.00 14,393,812,737,589,100.00 0.7181373
12 14,393,816,387,291,400.00 14,393,816,388,054,500.00 0.0002933702
13 14,393,810,074,468,400.00 14,393,810,869,659,100.00 0.305725
- There's a single thread, and we get the full range of timestamps to get the overall duration:
TID min tsc max tsc duration
83637 14,393,810,074,468,400.00 14,393,816,388,054,500.00 2.427369
Caveats
====
* Because clang was built with -O3, we're missing some function exits because of tail call and sibling optimisations -- we try to handle those cases from the function call accounting tool instead of adding instrumentation points (yet) in LLVM. Further work needs to be done to identify those exit points as well so that we have less ambiguity in the data.
* In that simple invocation of clang that took roughly around 1 second to complete, we are able to generate roughly 606MB from the naive in-memory logging implementation. That's a lot of data for the fully-instrumented clang version. Removing the '-fxray-instruction-threshold=1' flag defaults this to 200 instructions, and there's considerably less data logged and only from non-trivial functions (~34MB when I tested it).
* The durations are in seconds, computed by taking the maximum CPU frequency. We assume that the CPU TSC counters don't drift/pause (i.e. assume "constant tic" and "nonstop tsc"). This may not be safe assumptions to make, but should be fine for the interim as we iterate on how to get this data more reliably at least in Linux.
* If you add '-v' to the invocation of xray-fc-account, it will give you a stack reconstruction as it goes through every entry/exit event. Last I tried this it took a long time to generate the text, and I gave up when it went over the 1.7GB (of text) mark. It's pretty and hypnotic to watch the stack grow and shrink just from the entry and exits of functions, but it's not practical to do. :)
----
If you're still reading up to here, thank you for taking the time. I'm happy to iterate on the questions people might have.
Have fun, and I hope this helps!
Cheers
PS. If you'd like to get the raw trace, I'd be happy to share it but it's 81MB bzip2'd. :) Also, if you'd like to see the csv output or the text output, I'd be happy to share it too but not via email on the mailing list. :D
More information about the llvm-dev
mailing list