[llvm] [MLGO][Docs] Add documentation on corpus tooling (PR #139362)
Mircea Trofin via llvm-commits
llvm-commits at lists.llvm.org
Sat May 10 08:16:14 PDT 2025
================
@@ -18,8 +18,161 @@ This document is an outline of the tooling that composes MLGO.
Corpus Tooling
==============
-..
- TODO(boomanaiden154): Write this section.
+Within upstream LLVM, there is the ``mlgo-utils`` python packages that lives at
+``llvm/utils/mlgo-utils``. This package primarily contains tooling for working
+with corpora, or collections of LLVM bitcode. We use these corpora to
+
+.. program:: extract_ir.py
+
+Synopsis
+--------
+
+Extracts a corpus from some form of a structured compilation database. This
+tool supports a variety of different scenarios and input types.
+
+Options
+-------
+
+.. option:: --input
+
+ The path to the input. This should be a path to a supported structured
+ compilation database. Currently only ``compile_commands.json`` files, linker
+ parameter files, a directory containing object files (for the local
+ ThinLTO case only), or a JSON file containing a bazel aquery result are
+ supported.
+
+.. option:: --input_type
+
+ The type of input that has been passed to the ``--input`` flag.
+
+.. option:: --output_dir
+
+ The output directory to place the corpus in.
+
+.. option:: --num_workers
+
+ The number of workers to use for extracting bitcode into the corpus. This
+ defaults to the number of hardware threads available on the host system.
+
+.. option:: --llvm_objcopy_path
+
+ The path to the llvm-objcopy binary to use when extracting bitcode.
+
+.. option:: --obj_base_dir
+
+ The base directory for object files. Bitcode files that get extracted into
+ the corpus will be placed into the output directory based on where their
+ source object files are placed relative to this path.
+
+.. option:: --cmd_filter
+
+ Allows filtering of modules by command line. If set, only modules that much
+ the filter will be extracted into the corpus. Regular expressions are
+ supported in some instances.
+
+.. option:: --thinlto_build
+
+ If the build was performed with ThinLTO, this should be set to either
+ ``distributed`` or ``local`` depending upon how the build was performed.
+
+.. option:: --cmd_section_name
+
+ This flag allows specifying the command line section name. This is needed
+ on non-ELF platforms where the section name might differ.
+
+.. option:: --bitcode_section_name
+
+ This flag allows specifying the bitcode section name. This is needed on
+ non-ELF platforms where the section name might differ.
+
+Example: CMake
+--------------
+
+CMake can output a ``compilation_commands.json`` compilation database if the
+``CMAKE_EXPORT_COMPILE_COMMANDS`` switch is turned on at compile time. Assuming
+it was specified and there is a ``compilation_commands.json`` file within the
+``./build`` directory, you can run the following command to create a corpus:
+
+.. code-block:: bash
+
+ python3 ./extract_ir.py \
+ --input=./build/compile_commands.json \
+ --input_type=json \
+ --output_dir=./corpus
+
+This assumes that the compilation was performed with bitcode embedding
----------------
mtrofin wrote:
Maybe start with this and show how to do this for a clang build?
https://github.com/llvm/llvm-project/pull/139362
More information about the llvm-commits
mailing list