[llvm] [IR2Vec] Refactor vocabulary to use canonical type IDs (PR #155323)

Fri Aug 29 13:10:08 PDT 2025

================
@@ -137,13 +138,48 @@ using InstEmbeddingsMap = DenseMap<const Instruction *, Embedding>;
 using BBEmbeddingsMap = DenseMap<const BasicBlock *, Embedding>;
 
 /// Class for storing and accessing the IR2Vec vocabulary.
-/// Encapsulates all vocabulary-related constants, logic, and access methods.
+///
+/// The Vocabulary class manages seed embeddings for LLVM IR entities. It
+/// contains the seed embeddings for three types of entities: instruction
+/// opcodes, types, and operands. Types are grouped/canonicalized for better
+/// learning (e.g., all float variants map to FloatTy). The vocabulary abstracts
+/// away the canonicalization effectively, the exposed APIs handle all the known
+/// LLVM IR opcodes, types and operands.
+///
+/// This class helps populate the seed embeddings in an internal vector-based
+/// ADT. It provides logic to map every IR entity to a specific slot index or
+/// position in this vector, enabling O(1) embedding lookup while avoiding
+/// unnecessary computations involving string based lookups while generating the
+/// embeddings.
 class Vocabulary {
   friend class llvm::IR2VecVocabAnalysis;
   using VocabVector = std::vector<ir2vec::Embedding>;
   VocabVector Vocab;
   bool Valid = false;
 
+public:
+  // Slot layout:
----------------
svkeerthy wrote:

Yes we can do this refactoring. But, the tool needs consecutive indexing of entities while dumping the triplets. So, this logic should either be in the ir2vec::Vocabulary or should be moved to the tool.

https://github.com/llvm/llvm-project/pull/155323