[llvm] update_test_checks: keep meta variables stable by default (PR #76748)

Wed Mar 6 03:25:54 PST 2024

Nicolai =?utf-8?q?Hähnle?= <nicolai.haehnle at amd.com>,
Nicolai =?utf-8?q?Hähnle?= <nicolai.haehnle at amd.com>,
Nicolai =?utf-8?q?Hähnle?= <nicolai.haehnle at amd.com>,
Nicolai =?utf-8?q?Hähnle?= <nicolai.haehnle at amd.com>,
Nicolai =?utf-8?q?Hähnle?= <nicolai.haehnle at amd.com>
Message-ID:
In-Reply-To: <llvm.org/llvm/llvm-project/pull/76748 at github.com>


================
@@ -1187,20 +1233,317 @@ def may_clash_with_default_check_prefix_name(check_prefix, var):
     )
 
 
+def find_diff_matching(lhs: List[str], rhs: List[str]) -> List[int]:
+    """
+    Find a large ordered matching between strings in lhs and rhs.
+
+    Think of this as finding the *unchanged* lines in a diff, where the entries
+    of lhs and rhs are lines of the files being diffed.
+
+    Returns a list of matched (lhs_idx, rhs_idx) pairs.
+    """
+
+    # Collect matches in reverse order.
+    matches = []
+
+    def recurse(lhs_start, lhs_end, rhs_start, rhs_end):
+        if lhs_start == lhs_end or rhs_start == rhs_end:
+            return
+
+        # First, collect a set of candidate matching edges. We limit this to a
+        # constant multiple of the input size to avoid quadratic runtime.
+        patterns = collections.defaultdict(lambda: ([], []))
+
+        for idx in range(lhs_start, lhs_end):
+            patterns[lhs[idx]][0].append(idx)
+        for idx in range(rhs_start, rhs_end):
+            patterns[rhs[idx]][1].append(idx)
+
+        multiple_patterns = []
+
+        candidates = []
+        for pattern in patterns.values():
+            if not pattern[0] or not pattern[1]:
+                continue
+
+            if len(pattern[0]) == len(pattern[1]) == 1:
+                candidates.append((pattern[0][0], pattern[1][0]))
+            else:
+                multiple_patterns.append(pattern)
+
+        multiple_patterns.sort(key=lambda pattern: len(pattern[0]) * len(pattern[1]))
+
+        for pattern in multiple_patterns:
+            if len(candidates) + len(pattern[0]) * len(pattern[1]) > 2 * (len(lhs) + len(rhs)):
----------------
jasilvanus wrote:

I find this a bit overly restrictive -- this means we'll give up on

```
find_diff_matching(["foo", "foo", "foo"], ["foo", "foo", "foo"])
```

As simple improvement, we could just allow an additive constant (say 100) to not have to worry about small cases.

I think we can easily have large blocks of identical lines (ignoring variable names as in the first matching round), e.g. if there are many similar loads or stores resulting from a bulk copy, and it's a bit unsatisfying to completely give up on those.

Maybe instead we could limit the number of edges incident to any given line (say on the LHS) by a constant, which is enough to guarantee there are not too many edges, and if the number of matching lines in the RHS is larger, then take a subset? For example, we could sample randomly, or select lines in a similar relative position.

https://github.com/llvm/llvm-project/pull/76748