[PATCH] D22051: MergeSimilarFunctions: a code size pass to merge functions with small differences

Mon Jul 11 16:01:45 PDT 2016

On 07/07/2016 07:10 PM, Daniel Berlin wrote:
> Two thoughts after staring at it:
>
> 1. If you were to form SESE/etc regions and process them in topo 
> order, you could actually do even better than you do now.   You would 
> be able to validly say "the first X regions of this function are the 
> same". You currently do this in some cases, but you could do it in more :)

Yep, that's a good idea. You could then also outline identical regions 
from functions... :)

An easy improvement to the current traversal algorithm would be to 
enqueue successor blocks in the order of their instruction count as 
opposed to the order they appear in the terminator instruction. That 
way, you could handle easy cases where the successor blocks are swapped 
but otherwise equivalent or similar.

> 2. You could integrate most of what you are doing with mergefuncs with 
> a bit of work.

Depends on how you define "a bit of work" ;) See below.

>     As to whether this could be integrated into the in-tree
>     MergeFunctions... Well, this started off as a patch to
>     MergeFunctions back in 2013. However, the in-tree MergeFunctions
>     has undergone significant architectural changes since then; it now
>     uses a total ordering of functions to speed up merging. This is
>     great if you only want to merge identical functions, but it
>     doesn't work for merging of similar functions.
>
>
> This is not correct.
> You could use simhash, for example, or any thing that will give you 
> actual similarity metrics in addition to total ordering.

What I meant to say is that it doesn't make sense to use the *current* 
hash function and the *current* total ordering criterion in the in-tree 
MergeFunctions if you want to merge similar functions.

Why?

1. The hash function includes the instruction opcodes. Similar, but 
non-identical, functions will end up in different buckets. We'd have to 
replace the hash function by something less precise for similar merging. 
Hashing the CFG structure is a neat idea though, I should be doing that too.

2. The total ordering criterion (used inside the buckets) in 
MergeFunctions is based on equality. A very large chunk of the 
MergeFunctions code deals with those comparisons (see all the cmpXYZ 
functions). Similar functions won't be near each other if you use that 
specific ordering criterion, so it's not much use. We'd need to replace 
the criterion by something else - and yes, similarity hashing is a great 
option for that (although not yet implemented so unclear how well it'd 
work for IR).

So essentially, to make this work in MergeFunctions, we'd have to rip 
out two of its key advantages (which are great if you only want to merge 
identical functions)... and you'd essentially get this pass :)

Am I missing something?

Since the two MergeFunctions share a common ancestor, there is some 
shared code; basically a shared skeleton. So it should be possible to 
factor that out into a "MergeFunctionsUtils" sort of module. Perhaps 
that's a way to go if people are happy with that?

Thanks,
Tobias

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160711/9424d22a/attachment.html>