[llvm] [IR2Vec] Refactor vocabulary to use canonical type IDs (PR #155323)
S. VenkataKeerthy via llvm-commits
llvm-commits at lists.llvm.org
Fri Aug 29 13:10:08 PDT 2025
================
@@ -137,13 +138,48 @@ using InstEmbeddingsMap = DenseMap<const Instruction *, Embedding>;
using BBEmbeddingsMap = DenseMap<const BasicBlock *, Embedding>;
/// Class for storing and accessing the IR2Vec vocabulary.
-/// Encapsulates all vocabulary-related constants, logic, and access methods.
+///
+/// The Vocabulary class manages seed embeddings for LLVM IR entities. It
+/// contains the seed embeddings for three types of entities: instruction
+/// opcodes, types, and operands. Types are grouped/canonicalized for better
+/// learning (e.g., all float variants map to FloatTy). The vocabulary abstracts
+/// away the canonicalization effectively, the exposed APIs handle all the known
+/// LLVM IR opcodes, types and operands.
+///
+/// This class helps populate the seed embeddings in an internal vector-based
+/// ADT. It provides logic to map every IR entity to a specific slot index or
+/// position in this vector, enabling O(1) embedding lookup while avoiding
+/// unnecessary computations involving string based lookups while generating the
+/// embeddings.
class Vocabulary {
friend class llvm::IR2VecVocabAnalysis;
using VocabVector = std::vector<ir2vec::Embedding>;
VocabVector Vocab;
bool Valid = false;
+public:
+ // Slot layout:
----------------
svkeerthy wrote:
Yes we can do this refactoring. But, the tool needs consecutive indexing of entities while dumping the triplets. So, this logic should either be in the ir2vec::Vocabulary or should be moved to the tool.
https://github.com/llvm/llvm-project/pull/155323
More information about the llvm-commits
mailing list