[llvm-dev] llvm-canon

Fri Aug 9 18:18:12 PDT 2019

Nice work guys! Thanks for the shoutout.

High level comments:

1) I think it could be useful to move your Canonicalizer ModulePass from
llvm/tool/llvm-canon to somewhere in llvm/lib/Transforms so that it could
possibly be used by llc or clang. I think it could be really useful to do
something like clang -o - -S -emit-llvm -femit-canonical-ir foo.c and get
the canonicalized output directly. Keep in mind, anything that is
accessible through clang or llc can also be really easily accessed through
the webapp godbolt.org :-). I still think llvm-canon.cpp and llvm-canon
executable tool is fine, but I think it would be cool if the ModulePass
part were reusable in other parts of llvm.
2) clang-format
3) Add lit tests.

I am all for this being added to llvm. I'll post additional comments in
Phabricator.

Puyan

On Fri, Aug 9, 2019 at 1:37 PM Paszkowski, Michal via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Hi all,
>
> Many of us find ourselves spending a great chunk of time comparing LLVM IR
> dumps at various stages of compilation pipeline or after a given
> optimization pass. Said process can be extremely laborious, and this is
> especially true when comparing shaders or compute modules. Important
> semantic differences are often difficult to spot because of the irregular
> naming and ordering of instructions.
>
> Looking for a solution we have come across Puyan Lotfi's talk at 2018
> EuroLLVM on mir-canon (https://www.youtube.com/watch?v=RHT-bh_xo6U). His
> project inspired us to invest some time into developing a tool called
> llvm-canon aiming to achieve a very similar goal - a clean and obvious diff
> but for LLVM IR dumps.
>
> Currently the tool is used internally. We have decided to post here to
> receive some feedback from the community, spark a discussion on various
> canonicalization techniques, and ask for a general code review to
> eventually push the tool to the public trunk.
>
> The aim of this project is to develop a tool which will transform the LLVM
> IR code into a canonical form by reducing non-semantic differences.
> Therefore, in ideal conditions such canonical form will lead to a clean
> diff based upon the fact that:
> 1) Two semantically identical modules should show no differences when
> diffed after canonicalization.
> 2) When modules are not identical the semantic differences should stand
> out.
>
> The tool processes just one file/module at a time - there is no intention
> for this to be a diff tool. The canonicalization techniques are universal
> for all modules. Additionally, we want the tool to be as modular as
> possible allowing for easy changes in the future.
>
> All of the canonicalization techniques are focused only on reordering and
> naming. Having experimented with many options for removing redundant code
> we came to conclusion that this topic is definitely out of scope of this
> project and should be left out for other passes.
>
> Scope of this project:
> * Reordering instructions
> * Naming values (naming initial instructions, naming regular instructions)
> * Folding instruction names
> * Naming function arguments
> * Naming basic blocks
>
> One of the biggest challenges in the development of this project was
> finding a point of reference for canonicalization - naming and ordering
> must be based on something. It would be unacceptable for a trivial change
> (such as different opcode, single operand, name of the called function) in
> a single instruction to affect the canonicalization of other instructions
> in the same basic block or even function. Therefore, the point of reference
> should be fairly "constant". Currently, the canonicalization is highly
> based on the order of side-effecting instructions (also referred to as
> outputs) and how they relate to the very first instructions in each basic
> block (non-redundant instructions with only immediate operands; we call
> them initial instructions). We have found this reference to be relatively
> effective in our use cases, yet this is the biggest weakness of this
> canonicalizer. Modules with different layout of side-effecting instructions
> may require some tinkering by reordering those instructions by hand. We are
> still looking for some ways to automate this process.
>
> Below I will briefly describe each component of the canonicalizer.
>
> 1. Reordering instructions
> Instruction reordering starts by iterating through the vector of output
> instructions (side-effecting instructions + ReturnInst) collected top-down
> in a given function. On each of these instructions a recursive reordering
> method is called. This method walks up the def-use tree (starting from the
> first operand of an output instruction and going up) and reduces the
> distance between the definition and the user. Each instruction may be moved
> just once. In case the user is in a different basic block, the definition
> is placed at the end of the current basic block.
>
> The method works surprisingly well yet it does not reorder side-effecting
> instructions. It is one of the subjects for future discussion and
> improvement of this tool. Another thing to note is that this algorithm only
> works on a vector of outputs collected top-down. Moving every instruction
> just once while making sure that it is as close as possible to the first
> use guarantees that the def-use chain will be valid upon completion.
>
> 2. Naming instructions
> Naming also starts with a vector of output instructions collected
> top-down. On each of these instructions a recursive naming method is
> called. This method identifies the type of an instruction and names it
> accordingly.
>
> Currently, we distinguish two types of instructions when it comes to
> naming:
> a) Initial instruction (non-redundant instructions with only immediate
> operands)
> b) Regular instruction (all other non-redundant instructions)
>
> In case of initial instructions, the naming follows this general model:
> vl <hash> <callee name> ( <operands> )
>
> A sample name is given below:
> vl12345Foo (2, 4)
>
> The hash is calculated considering instruction's opcode and output
> footprint. Output footprint is a vector of unique indices (distance from
> the beginning of the basic block) of output instructions generated for each
> initial instruction. This being said, an initial instruction with an output
> footprint 5, 7, 9, 11 is used by those outputs at least once.
>
> The callee name is only included when the named instruction is a CallInst.
>
> The very last part of the initial instruction name consists of a list of
> operands in parentheses. The list is sorted when the instruction is
> commutative.
>
>
> In case of regular instructions, the naming follows this general model:
> op <hash> <callee name> ( <operands call chain> )
>
> With a sample name given below:
> op11111Foo (op99999Bar(op55555(...), op22222(...)), op77777(...),
> vl44444(7))
>
> The hash is calculated considering instruction's opcode and opcodes of all
> operands.
>
> The callee name is only included when the named instruction is a CallInst.
>
> The last part of the regular instruction name model is a recursively
> generated list of all operands, and their operands, and so on until we
> reach initial instructions (depth-first). This may remind more of a call
> chain than operand list. This naming approach makes differences stand out
> on all instructions that are affected. This is a desired effect on output
> instructions as we can instantly see that the semantics have changed. In
> cases when an output instruction is void, this type of naming is useful on
> pre-output instructions (direct operands of an output instruction). Besides
> that this technique pollutes the diff as a single change resonates down the
> block. This is why names are "folded" in the next phase of canonicalization.
>
> 3. Folding instruction names
> As I have mentioned before, recursively generated long operand lists are
> quite useful when comparing output instructions or their proximate operands
> in case of void output. In all other cases a simple operand list should
> result in a much more readable diff. This is why after naming all the
> instructions, we iterate through them one more time to "fold"/shorten their
> names. In this phase the operands call chain is substituted with an operand
> list, and the callee name is removed.
>
> Folding initial instructions:
> vl00000Calle(1, 2)  ->  vl00000(1, 2)
>
> Folding regular instructions:
> op00000Calle(op00001Calle(...), vl00000Calle(1, 2), ...)  ->
> op00000(op00001, vl00000, ...)
>
>
> Therefore naming and folding give us this kind of effect:
> Folded (no changes):                    vl00001(1, 2) = ...
> Folded (removed callee name):   vl00002(6, 5) = call ...
> Folded (kept only hash):                op10001(vl00001, vl00002) = ...
> Unfolded (output):                      op10002(op10001(vl00001(1, 2),
> vl00002Foo(6, 5)), vl00001(1, 2)) = ...
>
> 4. Other
> We have been looking for various ways to canonicalize the names of
> function arguments. But because of their marginal impact on the overall
> diff we have decided to stick to just numbering them.
>
> We are also naming basic blocks following a scheme: bb <hash> where the
> hash value is calculated considering the opcodes and order of output
> instructions in a given basic block.
>
> 5. Areas of further research
> - As mentioned before we are constantly looking for some different points
> of reference for canonicalization.
> - Approaches allowing for automatic reordering of side-effecting
> instructions.
> - How sorting operand list on commutative instructions could affect
> instruction reordering.
>
> The canonicalizer is highly modular allowing for easy modifications and
> future expansion.
>
> We hope the community will find this tool useful like we have.
>
> I would like to thank Radoslaw Drabinski who really helped me a lot by
> suggesting numerous improvements and testing the tool.
>
> I have uploaded the tool to the Phabricator and would be very grateful for
> your review!
> https://reviews.llvm.org/D66029
>
> Thank you,
> Michal Paszkowski
>
>
> --------------------------------------------------------------------
>
> Intel Technology Poland sp. z o.o.
> ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII
> Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP
> 957-07-52-316 | Kapital zakladowy 200.000 PLN.
>
> Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego
> adresata i moze zawierac informacje poufne. W razie przypadkowego
> otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej
> usuniecie; jakiekolwiek
> przegladanie lub rozpowszechnianie jest zabronione.
> This e-mail and any attachments may contain confidential material for the
> sole use of the intended recipient(s). If you are not the intended
> recipient, please contact the sender and delete all copies; any review or
> distribution by
> others is strictly prohibited.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190809/c459cd91/attachment.html>