[llvm-dev] [XRay] Build instrumented Clang, some analysis results

Dean Michael Berris via llvm-dev llvm-dev at lists.llvm.org
Wed Jul 20 00:48:02 PDT 2016

Hi everyone,

TL;DR: With current pending patches applied in compiler-rt and llvm, and trunk clang, you can build your application with XRay tracing enabled on Linux with tracing enabled before main starts, and logging stops when the main thread exits.

Just a quick update, I have some patches under review that when applied cleanly to LLVM and compiler-rt allows for building applications with XRay tracing enabled on Linux binaries from start to end. The list of patches to apply are:

  * https://reviews.llvm.org/D21987 (accounting tool, lives in llvm)
  * https://reviews.llvm.org/D21612 then https://reviews.llvm.org/D21982 (compiler-rt XRay runtime implementation)

To test this out, I built clang with XRay and ran it on a hello world application. Here's how I did it:

1) Build clang+compiler-rt with the above patches patched in. Follow the normal instructions for getting a version of clang (from trunk) built. I use CMake+Ninja to do it, and I build everything. Example session:

# assume ./llvm is set up appropriately
$ cd llvm-build
$ cmake -DCMAKE_BUILD_TYPE=Release -G Ninja ../llvm
$ ninja
$ cd ..

2) Build clang again, preferably in another build directory where you can use the clang and compiler-rt built in #1, this time with the following additional flags for CMAKE_CXX_FLAGS: `-fxray-instrument` `-fxray-instruction-threshold=1`. The second flag tells clang+llvm to instrument functions that have at least one instruction in it. An example session of this would look something like:

$ mkdir xray-llvm-build
$ cd xray-llvm-build
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=../llvm-build/bin/clang++ -DCMAKE_C_COMPILER=../llvm-build/bin/clang -DCMAKE_CXX_FLAGS='-fxray-instrument -fxray-instruction-threshold=1 -g' -DBUILD_SHARED_LIBS=Off ../llvm
$ ninja clang
$ cd ..

3) Run clang on a sample program, configure XRAY_OPTIONS while doing so for verbosity control. In my case I have the following "Hello, World!" program compiled by clang:

---->8 test.cc 8<----
#include <iostream>

int main(int argc, char* argv[]) {
  std::cout << "I'm being compiled with XRay!" << std::endl;
---->8 test.cc 8<----

$ XRAY_OPTIONS='verbosity=1' xray-llvm-build/bin/clang -o test.bin -O3 -x c++ -stdlib=libstdc++ test.cc

You should then see two lines similar to the following printed:

XRay: Log file in 'xray-log.DqqGQk'

4) Once you have one of those XRay log files, we can then use the 'xray-fc-account' tool built in the llvm-build/bin directory (in step 1). Warning: there's a lot of output in this step, I suggest piping to files. Example session:

$ llvm-build/bin/xray-fc-account xray-log.DqqGQk -m xray-llvm-build/bin/clang -k >sample.txt 2>errs.txt

We can even output it to CSV, which allows us to load the output in a spreadsheet application where we can start sorting and analysing the data as we see fit, by adding the "-format" flag to the above command:

$ llvm-build/bin/xray-fc-account xray-log.DqqGQk -m xray-llvm-build/bin/clang -k -format=csv >sample.csv 2>errs.txt

Analysis Results

Given the above steps, I've done some quick analysis with my favourite spreadsheet application where I'm showing the top results for the Clang run above. Here's some interesting findings:

- The top 20 functions called in clang (in the above invocation) are:

292174	clang::DeclContext::getPrimaryContext()
283681	llvm::FoldingSetNodeID::AddPointer(void const*)
214706	clang::Preprocessor::Lex(clang::Token&)
193352	llvm::FoldingSetNodeID::AddInteger(unsigned int)
161706	clang::Lexer::LexTokenInternal(clang::Token&, bool)
161055	clang::Lexer::Lex(clang::Token&)
143420	clang::DeclarationName::getNameKind() const
142071	llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul>::Allocate(unsigned long, unsigned long)
139009	clang::NamespaceDecl::getOriginalNamespace()
125706	clang::Redeclarable<clang::TagDecl>::DeclLink::getNext(clang::TagDecl const*) const
107928	clang::Type::getAsTagDecl() const
107785	clang::Type::getAsCXXRecordDecl() const
101150	llvm::FoldingSetNodeID::operator==(llvm::FoldingSetNodeID const&) const
96244	llvm::hashing::detail::hash_short(char const*, unsigned long, unsigned long)
94872	_ZN4llvm7hashing6detail23hash_combine_range_implIKjEENSt9enable_ifIXsr16is_hashable_dataIT_EE5valueENS_9hash_codeEE4typeEPS5_S9_
91536	clang::DeclContext::lookup(clang::DeclarationName) const
84122	clang::TypeLoc::getNextTypeLocImpl(clang::TypeLoc)
81368	clang::TypeLoc::getLocalSourceRangeImpl(clang::TypeLoc)
80626	clang::FunctionDecl::getCanonicalDecl()
79489	clang::TagDecl::getDefinition() const
77347	clang::Lexer::LexIdentifier(clang::Token&, char const*)

- The top few slowest (filtered by hand, because, see caveats below):

1.  main
2.  clang::Parser::ParseNamespace(unsigned int, clang::SourceLocation&, clang::SourceLocation)
3.  clang::Parser::ParseInnerNamespace(std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> >&, std::vector<clang::IdentifierInfo*, std::allocator<clang::IdentifierInfo*> >&, std::vector<clang::SourceLocation, std::allocator<clang::SourceLocation> >&, unsigned int, clang::SourceLocation&, clang::ParsedAttributes&, clang::BalancedDelimiterTracker&)
4.  clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&)
5.  clang::Parser::ParseDeclaration(unsigned int, clang::SourceLocation&, clang::Parser::ParsedAttributesWithRange&)
6.  clang::Parser::ParseDeclarationStartingWithTemplate(unsigned int, clang::SourceLocation&, clang::AccessSpecifier, clang::AttributeList*)
7.  clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&)
8.  clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::DataLayout const&, llvm::Module*, clang::BackendAction, llvm::raw_pwrite_stream*)
9.  clang::Parser::ParseExternalDeclaration(clang::Parser::ParsedAttributesWithRange&, clang::ParsingDeclSpec*)
10. clang::Parser::ParseTemplateDeclarationOrSpecialization(unsigned int, clang::SourceLocation&, clang::AccessSpecifier, clang::AttributeList*)
11. clang::Parser::ParseSingleDeclarationAfterTemplate(unsigned int, clang::Parser::ParsedTemplateInfo const&, clang::ParsingDeclRAIIObject&, clang::SourceLocation&, clang::AccessSpecifier, clang::AttributeList*)

- For a single-threaded application, it ran on three "cpus":

CPUID	min tsc	max tsc	duration
5	14,393,810,869,713,800.00	14,393,812,737,589,100.00	0.7181373
12	14,393,816,387,291,400.00	14,393,816,388,054,500.00	0.0002933702
13	14,393,810,074,468,400.00	14,393,810,869,659,100.00	0.305725

- There's a single thread, and we get the full range of timestamps to get the overall duration:

TID	min tsc	max tsc	duration
83637	14,393,810,074,468,400.00	14,393,816,388,054,500.00	2.427369


* Because clang was built with -O3, we're missing some function exits because of tail call and sibling optimisations -- we try to handle those cases from the function call accounting tool instead of adding instrumentation points (yet) in LLVM. Further work needs to be done to identify those exit points as well so that we have less ambiguity in the data.

* In that simple invocation of clang that took roughly around 1 second to complete, we are able to generate roughly 606MB from the naive in-memory logging implementation. That's a lot of data for the fully-instrumented clang version. Removing the '-fxray-instruction-threshold=1' flag defaults this to 200 instructions, and there's considerably less data logged and only from non-trivial functions (~34MB when I tested it).

* The durations are in seconds, computed by taking the maximum CPU frequency. We assume that the CPU TSC counters don't drift/pause (i.e. assume "constant tic" and "nonstop tsc"). This may not be safe assumptions to make, but should be fine for the interim as we iterate on how to get this data more reliably at least in Linux.

* If you add '-v' to the invocation of xray-fc-account, it will give you a stack reconstruction as it goes through every entry/exit event. Last I tried this it took a long time to generate the text, and I gave up when it went over the 1.7GB (of text) mark. It's pretty and hypnotic to watch the stack grow and shrink just from the entry and exits of functions, but it's not practical to do. :)


If you're still reading up to here, thank you for taking the time. I'm happy to iterate on the questions people might have.

Have fun, and I hope this helps!


PS. If you'd like to get the raw trace, I'd be happy to share it but it's 81MB bzip2'd. :) Also, if you'd like to see the csv output or the text output, I'd be happy to share it too but not via email on the mailing list. :D

More information about the llvm-dev mailing list