[llvm-dev] llvm-canon

Fri Aug 9 13:37:11 PDT 2019

Hi all,

Many of us find ourselves spending a great chunk of time comparing LLVM IR dumps at various stages of compilation pipeline or after a given optimization pass. Said process can be extremely laborious, and this is especially true when comparing shaders or compute modules. Important semantic differences are often difficult to spot because of the irregular naming and ordering of instructions.

Looking for a solution we have come across Puyan Lotfi's talk at 2018 EuroLLVM on mir-canon (https://www.youtube.com/watch?v=RHT-bh_xo6U). His project inspired us to invest some time into developing a tool called llvm-canon aiming to achieve a very similar goal - a clean and obvious diff but for LLVM IR dumps.

Currently the tool is used internally. We have decided to post here to receive some feedback from the community, spark a discussion on various canonicalization techniques, and ask for a general code review to eventually push the tool to the public trunk.

The aim of this project is to develop a tool which will transform the LLVM IR code into a canonical form by reducing non-semantic differences. Therefore, in ideal conditions such canonical form will lead to a clean diff based upon the fact that:
1) Two semantically identical modules should show no differences when diffed after canonicalization.
2) When modules are not identical the semantic differences should stand out.

The tool processes just one file/module at a time - there is no intention for this to be a diff tool. The canonicalization techniques are universal for all modules. Additionally, we want the tool to be as modular as possible allowing for easy changes in the future.

All of the canonicalization techniques are focused only on reordering and naming. Having experimented with many options for removing redundant code we came to conclusion that this topic is definitely out of scope of this project and should be left out for other passes. 

Scope of this project:
* Reordering instructions
* Naming values (naming initial instructions, naming regular instructions)
* Folding instruction names
* Naming function arguments
* Naming basic blocks

One of the biggest challenges in the development of this project was finding a point of reference for canonicalization - naming and ordering must be based on something. It would be unacceptable for a trivial change (such as different opcode, single operand, name of the called function) in a single instruction to affect the canonicalization of other instructions in the same basic block or even function. Therefore, the point of reference should be fairly "constant". Currently, the canonicalization is highly based on the order of side-effecting instructions (also referred to as outputs) and how they relate to the very first instructions in each basic block (non-redundant instructions with only immediate operands; we call them initial instructions). We have found this reference to be relatively effective in our use cases, yet this is the biggest weakness of this canonicalizer. Modules with different layout of side-effecting instructions may require some tinkering by reordering those instructions by hand. We are still looking for some ways to automate this process. 

Below I will briefly describe each component of the canonicalizer.

1. Reordering instructions
Instruction reordering starts by iterating through the vector of output instructions (side-effecting instructions + ReturnInst) collected top-down in a given function. On each of these instructions a recursive reordering method is called. This method walks up the def-use tree (starting from the first operand of an output instruction and going up) and reduces the distance between the definition and the user. Each instruction may be moved just once. In case the user is in a different basic block, the definition is placed at the end of the current basic block.

The method works surprisingly well yet it does not reorder side-effecting instructions. It is one of the subjects for future discussion and improvement of this tool. Another thing to note is that this algorithm only works on a vector of outputs collected top-down. Moving every instruction just once while making sure that it is as close as possible to the first use guarantees that the def-use chain will be valid upon completion.

2. Naming instructions
Naming also starts with a vector of output instructions collected top-down. On each of these instructions a recursive naming method is called. This method identifies the type of an instruction and names it accordingly.

Currently, we distinguish two types of instructions when it comes to naming:
a) Initial instruction (non-redundant instructions with only immediate operands)
b) Regular instruction (all other non-redundant instructions)

In case of initial instructions, the naming follows this general model:
vl <hash> <callee name> ( <operands> )

A sample name is given below:
vl12345Foo (2, 4)

The hash is calculated considering instruction's opcode and output footprint. Output footprint is a vector of unique indices (distance from the beginning of the basic block) of output instructions generated for each initial instruction. This being said, an initial instruction with an output footprint 5, 7, 9, 11 is used by those outputs at least once.

The callee name is only included when the named instruction is a CallInst.

The very last part of the initial instruction name consists of a list of operands in parentheses. The list is sorted when the instruction is commutative.

In case of regular instructions, the naming follows this general model:
op <hash> <callee name> ( <operands call chain> )

With a sample name given below:
op11111Foo (op99999Bar(op55555(...), op22222(...)), op77777(...), vl44444(7))

The hash is calculated considering instruction's opcode and opcodes of all operands.

The callee name is only included when the named instruction is a CallInst.

The last part of the regular instruction name model is a recursively generated list of all operands, and their operands, and so on until we reach initial instructions (depth-first). This may remind more of a call chain than operand list. This naming approach makes differences stand out on all instructions that are affected. This is a desired effect on output instructions as we can instantly see that the semantics have changed. In cases when an output instruction is void, this type of naming is useful on pre-output instructions (direct operands of an output instruction). Besides that this technique pollutes the diff as a single change resonates down the block. This is why names are "folded" in the next phase of canonicalization.

3. Folding instruction names
As I have mentioned before, recursively generated long operand lists are quite useful when comparing output instructions or their proximate operands in case of void output. In all other cases a simple operand list should result in a much more readable diff. This is why after naming all the instructions, we iterate through them one more time to "fold"/shorten their names. In this phase the operands call chain is substituted with an operand list, and the callee name is removed.

Folding initial instructions:
vl00000Calle(1, 2)  ->  vl00000(1, 2)

Folding regular instructions:
op00000Calle(op00001Calle(...), vl00000Calle(1, 2), ...)  ->  op00000(op00001, vl00000, ...)

Therefore naming and folding give us this kind of effect:
Folded (no changes): 			vl00001(1, 2) = ...
Folded (removed callee name): 	vl00002(6, 5) = call ...
Folded (kept only hash): 		op10001(vl00001, vl00002) = ...
Unfolded (output): 			op10002(op10001(vl00001(1, 2), vl00002Foo(6, 5)), vl00001(1, 2)) = ...

4. Other
We have been looking for various ways to canonicalize the names of function arguments. But because of their marginal impact on the overall diff we have decided to stick to just numbering them.

We are also naming basic blocks following a scheme: bb <hash> where the hash value is calculated considering the opcodes and order of output instructions in a given basic block.

5. Areas of further research
- As mentioned before we are constantly looking for some different points of reference for canonicalization.
- Approaches allowing for automatic reordering of side-effecting instructions.
- How sorting operand list on commutative instructions could affect instruction reordering.

The canonicalizer is highly modular allowing for easy modifications and future expansion.

We hope the community will find this tool useful like we have.

I would like to thank Radoslaw Drabinski who really helped me a lot by suggesting numerous improvements and testing the tool.

I have uploaded the tool to the Phabricator and would be very grateful for your review!
https://reviews.llvm.org/D66029

Thank you,
Michal Paszkowski

--------------------------------------------------------------------

Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.

Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek
przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by
others is strictly prohibited.