[llvm] Adding IR2Vec as an analysis pass (PR #134004)

Wed May 14 23:52:08 PDT 2025

================
@@ -174,3 +174,147 @@ clang.
     TODO(mtrofin): 
         - logging, and the use in interactive mode.
         - discuss an example (like the inliner)
+
+IR2Vec Embeddings
+=================
+
+IR2Vec is a program embedding approach designed specifically for LLVM IR. It
+is implemented as a function analysis pass in LLVM. The IR2Vec embeddings
+capture syntactic, semantic, and structural properties of the IR through 
+learned representations. These representations are obtained as a JSON 
+vocabulary that maps the entities of the IR (opcodes, types, operands) to 
+n-dimensional floating point vectors (embeddings). 
+
+With IR2Vec, representation at different granularities of IR, such as
+instructions, functions, and basic blocks, can be obtained. Representations 
+of loops and regions can be derived from these representations, which can be
+useful in different scenarios. The representations can be useful for various
+downstream tasks, including ML-guided compiler optimizations.
+
+Currently, to use IR2Vec embeddings, the JSON vocabulary first needs to be read
+and used to obtain the vocabulary mapping. Then, use this mapping to
+derive the representations. In LLVM, this process is implemented using two
+independent passes: ``IR2VecVocabAnalysis`` and ``IR2VecAnalysis``. The former
+reads the JSON vocabulary and populates ``IR2VecVocabResult``, which is then used
+by ``IR2VecAnalysis``. 
+
+``IR2VecVocabAnalysis`` is immutable and is intended to
+be run once before ``IR2VecAnalysis`` is run. In the future, we plan
+to improve this requirement by automatically generating default the vocabulary mappings
+during build time, eliminating the need for a separate file read.
----------------
svkeerthy wrote:

Removed implementation specific details and kept it simple.

https://github.com/llvm/llvm-project/pull/134004