[llvm-branch-commits] [llvm] [IR2Vec] Adding documentation for llvm-ir2vec tool (PR #148719)
via llvm-branch-commits
llvm-branch-commits at lists.llvm.org
Mon Jul 14 13:48:16 PDT 2025
llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-llvm-binary-utilities
Author: S. VenkataKeerthy (svkeerthy)
<details>
<summary>Changes</summary>
---
Full diff: https://github.com/llvm/llvm-project/pull/148719.diff
4 Files Affected:
- (modified) llvm/docs/CommandGuide/index.rst (+1)
- (added) llvm/docs/CommandGuide/llvm-ir2vec.rst (+170)
- (modified) llvm/docs/MLGO.rst (+11-1)
- (modified) llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp (+2-6)
``````````diff
diff --git a/llvm/docs/CommandGuide/index.rst b/llvm/docs/CommandGuide/index.rst
index 88fc1fd326b76..f85f32a1fdd51 100644
--- a/llvm/docs/CommandGuide/index.rst
+++ b/llvm/docs/CommandGuide/index.rst
@@ -27,6 +27,7 @@ Basic Commands
llvm-dis
llvm-dwarfdump
llvm-dwarfutil
+ llvm-ir2vec
llvm-lib
llvm-libtool-darwin
llvm-link
diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst
new file mode 100644
index 0000000000000..13fe4996b968f
--- /dev/null
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -0,0 +1,170 @@
+llvm-ir2vec - IR2Vec Embedding Generation Tool
+==============================================
+
+.. program:: llvm-ir2vec
+
+SYNOPSIS
+--------
+
+:program:`llvm-ir2vec` [*options*] *input-file*
+
+DESCRIPTION
+-----------
+
+:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
+generates IR2Vec embeddings for LLVM IR and supports triplet generation
+for vocabulary training. It provides two main operation modes:
+
+1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
+ training from LLVM IR.
+
+2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
+ at different granularity levels (instruction, basic block, or function).
+
+The tool is designed to facilitate machine learning applications that work with
+LLVM IR by converting the IR into numerical representations that can be used by
+ML models.
+
+.. note::
+
+ For information about using IR2Vec programmatically within LLVM passes and
+ the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
+ section in the MLGO documentation.
+
+OPERATION MODES
+---------------
+
+Triplet Generation Mode
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
+consisting of opcodes, types, and operands. These triplets can be used to train
+vocabularies for embedding generation.
+
+Usage:
+
+.. code-block:: bash
+
+ llvm-ir2vec --mode=triplets input.bc -o triplets.txt
+
+Embedding Generation Mode
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In embedding mode, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
+generate numerical embeddings for LLVM IR at different levels of granularity.
+
+Example Usage:
+
+.. code-block:: bash
+
+ llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt
+
+OPTIONS
+-------
+
+.. option:: --mode=<mode>
+
+ Specify the operation mode. Valid values are:
+
+ * ``triplets`` - Generate triplets for vocabulary training
+ * ``embeddings`` - Generate embeddings using trained vocabulary (default)
+
+.. option:: --level=<level>
+
+ Specify the embedding generation level. Valid values are:
+
+ * ``inst`` - Generate instruction-level embeddings
+ * ``bb`` - Generate basic block-level embeddings
+ * ``func`` - Generate function-level embeddings (default)
+
+.. option:: --function=<name>
+
+ Process only the specified function instead of all functions in the module.
+
+.. option:: --ir2vec-vocab-path=<path>
+
+ Specify the path to the vocabulary file (required for embedding mode).
+ The vocabulary file should be in JSON format and contain the trained
+ vocabulary for embedding generation. See `llvm/lib/Analysis/models`
+ for pre-trained vocabulary files.
+
+.. option:: --ir2vec-opc-weight=<weight>
+
+ Specify the weight for opcode embeddings (default: 1.0). This controls
+ the relative importance of instruction opcodes in the final embedding.
+
+.. option:: --ir2vec-type-weight=<weight>
+
+ Specify the weight for type embeddings (default: 0.5). This controls
+ the relative importance of type information in the final embedding.
+
+.. option:: --ir2vec-arg-weight=<weight>
+
+ Specify the weight for argument embeddings (default: 0.2). This controls
+ the relative importance of operand information in the final embedding.
+
+.. option:: -o <filename>
+
+ Specify the output filename. Use ``-`` to write to standard output (default).
+
+.. option:: --help
+
+ Print a summary of command line options.
+
+.. note::
+
+ ``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``,
+ ``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding
+ mode. These options are ignored in triplet mode.
+
+INPUT FILE FORMAT
+-----------------
+
+:program:`llvm-ir2vec` accepts LLVM bitcode files (``.bc``) and LLVM IR files
+(``.ll``) as input. The input file should contain valid LLVM IR.
+
+OUTPUT FORMAT
+-------------
+
+Triplet Mode Output
+~~~~~~~~~~~~~~~~~~~
+
+In triplet mode, the output consists of lines containing space-separated triplets:
+
+.. code-block:: text
+
+ <opcode> <type> <operand1> <operand2> ...
+
+Each line represents the information of one instruction, with the opcode, type,
+and operands.
+
+Embedding Mode Output
+~~~~~~~~~~~~~~~~~~~~~
+
+In embedding mode, the output format depends on the specified level:
+
+* **Function Level**: One embedding vector per function
+* **Basic Block Level**: One embedding vector per basic block, grouped by function
+* **Instruction Level**: One embedding vector per instruction, grouped by basic block and function
+
+Each embedding is represented as a floating point vector.
+
+EXIT STATUS
+-----------
+
+:program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure.
+
+Common failure cases include:
+
+* Invalid or missing input file
+* Missing or invalid vocabulary file (in embedding mode)
+* Specified function not found in the module
+* Invalid command line options
+
+SEE ALSO
+--------
+
+:doc:`../MLGO`
+
+For more information about the IR2Vec algorithm and approach, see:
+`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
diff --git a/llvm/docs/MLGO.rst b/llvm/docs/MLGO.rst
index ed0769bebeac3..965a21b8c84b8 100644
--- a/llvm/docs/MLGO.rst
+++ b/llvm/docs/MLGO.rst
@@ -468,6 +468,13 @@ The core components are:
Using IR2Vec
------------
+.. note::
+
+ This section describes how to use IR2Vec within LLVM passes. A standalone
+ tool :doc:`CommandGuide/llvm-ir2vec` is available for generating the
+ embeddings and triplets from LLVM IR files, which can be useful for
+ training vocabularies and generating embeddings outside of compiler passes.
+
For generating embeddings, first the vocabulary should be obtained. Then, the
embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance.
@@ -524,6 +531,10 @@ Further Details
For more detailed information about the IR2Vec algorithm, its parameters, and
advanced usage, please refer to the original paper:
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
+
+For information about using IR2Vec tool for generating embeddings and
+triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`.
+
The LLVM source code for ``IR2Vec`` can also be explored to understand the
implementation details.
@@ -595,4 +606,3 @@ optimizations that are currently MLGO-enabled, it may be used as follows:
where the ``name`` is a path fragment. We will expect to find 2 files,
``<name>.in`` (readable, data incoming from the managing process) and
``<name>.out`` (writable, the model runner sends data to the managing process)
-
diff --git a/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp b/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp
index eba8c2e5678b1..c9e2c7c713e18 100644
--- a/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp
+++ b/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp
@@ -276,12 +276,8 @@ int main(int argc, char **argv) {
"Generates embeddings for a given LLVM IR and "
"supports triplet generation for vocabulary "
"training and embedding generation.\n\n"
- "Usage:\n"
- " Triplet mode: llvm-ir2vec --mode=triplets input.bc\n"
- " Embedding mode: llvm-ir2vec --mode=embeddings "
- "--ir2vec-vocab-path=vocab.json --level=func input.bc\n"
- " Levels: --level=inst (instructions), --level=bb (basic blocks), "
- "--level=func (functions)\n");
+ "See https://llvm.org/docs/CommandGuide/llvm-ir2vec.html for more "
+ "information.\n");
// Validate command line options
if (Mode == TripletMode && Level.getNumOccurrences() > 0)
``````````
</details>
https://github.com/llvm/llvm-project/pull/148719
More information about the llvm-branch-commits
mailing list