[llvm] [SampleFDO] Improve stale profile matching by diff algorithm (PR #87375)

Mon Apr 29 13:20:52 PDT 2024

https://github.com/wlei-llvm commented:

Thanks so much for the review! 
Seems many suggestions are related to specialize the diff algorithm. Let me try to clarify them together here: 
- The algorithm is a basic version, the result is optimal, but there are still many space for time/space optimizations, decoupling it from the stale matching could make people easily extend it in future. 
- The paper is very classical(assuming it's rigorously designed and written), so I wanted to keep code/variables as much as the same as the paper to convey the intension this is really a "copy-paste" of the algorithm(rather than my maybe inaccurate interpretation).
- In very soon, I want to use this diff algorithm for other purpose: function renaming matching, which use the edit scripts to compute a similarity of profile and IR, like `similarity = equalSet / (equalSet + insertSet + deleteSet)`, so those `insertSet` and `deleteSet` will be used there. 

>can we replace A,B,M,N,X,Y,V names with something meaningful and more readable? Also Max is not really Max but total size of two sequence.

The original paper converts the problem(find the LCS) to find the shortest path in the edit graph, I guess those X,Y,M,N, are just common used notation in graph algorithm area. `V` is `Vector?`new defined in the paper, but I guess from "An array, V, contains the..."), the `MAX` is the upper bound(the longest path). So basically all the variables are the paper's original used/defined variables( reason #2).

It seems the confusing comes up after I renamed the function from "longestCommonSequence" to "shortestEdit",  which the paper actually does the  `shortestEdit`, I think I need to change it back to `shortestEdit`. 

>Does this simple +Max need a lambda function?

This is also for reason #2, The index in pseudo code of the paper can be negative(guess an old c-style), this is just to simply use `V[Index(K - 1)]` to looks closer to `V[K - 1]` than `V[Max + K - 1]`

>It would simpler if LCS just return a "CommonSequence" as represented by LocToLocMap.
> The abstraction of DiffResult and MyersDiff seem a bit unnecessary. A standalone function LocToLocMap longestCommonSequence(..) would be cleaner I think.

Similarly here, for reason #3, it will need the other fields(`insert`/`delete`). 

>Slight preference to test it with lit rather than unit test. I think we should be able to craft a test case that shows different matching from the new algorithm?
>That way we can also remove the test/debug only APIs.

Similarly, I wanted to make it easy to extend to a "shared" library, so tried to test it independently. 

https://github.com/llvm/llvm-project/pull/87375