[PATCH] D123776: [Support] Optimize (.*) regex matches

Nikita Popov via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Thu Apr 14 03:14:06 PDT 2022


nikic created this revision.
Herald added subscribers: luke957, dexonsmith, luismarques, s.egerton, PkmX, simoncook, hiraditya, arichardson.
Herald added a project: All.
nikic requested review of this revision.
Herald added subscribers: llvm-commits, pcwang-thead.
Herald added a project: LLVM.

If capturing groups are used, the regex matcher handles something like `(.*)suffix` by first doing a maximal match of `.*`, trying to match `suffix` afterward, and then reducing the maximal stop position one by one until this finally succeeds. This makes the match quadratic in the length of the line (with large constant factors).

This is particularly problematic because regexes of this form are ubiquitous in FileCheck (something like `[[VAR:%.*]] = ...` falls in this category), making FileCheck executions much slower than they have any right to be.

This implements a very crude optimization that checks if `suffix` starts with a fixed character, and steps back to the last occurrence of that character, instead of stepping back by one character at the time. This drops FileCheck time on `clang/test/CodeGen/RISCV/rvv-intrinsics/vloxseg_mask.c` from 7.3 seconds to 2.7 seconds.

An obvious further improvement would be to check more than one character (once again, this is particularly relevant for FileCheck, because the next character is usually a space, which happens to have many occurrences).

This should help with https://github.com/llvm/llvm-project/issues/54821.


https://reviews.llvm.org/D123776

Files:
  llvm/lib/Support/regengine.inc


Index: llvm/lib/Support/regengine.inc
===================================================================
--- llvm/lib/Support/regengine.inc
+++ llvm/lib/Support/regengine.inc
@@ -53,6 +53,7 @@
 #define	at	sat
 #define	match	smat
 #define	nope	snope
+#define step_back	sstep_back
 #endif
 #ifdef LNAMES
 #define	matcher	lmatcher
@@ -65,6 +66,7 @@
 #define	at	lat
 #define	match	lmat
 #define	nope	lnope
+#define step_back	lstep_back
 #endif
 
 /* another structure passed up and down to avoid zillions of parameters */
@@ -288,6 +290,38 @@
 	return(0);
 }
 
+/* Step back from "stop" to a position where the strip startst..stopst might
+ * match. This can always conservatively return "stop - 1", but may return an
+ * earlier position if matches at later positions are impossible. */
+static const char *
+step_back(struct re_guts *g, const char *start, const char *stop, sopno startst,
+          sopno stopst)
+{
+	/* Always step back at least one character. */
+	assert(stop > start);
+	const char *res = stop - 1;
+
+	/* Check whether the strip startst..stropst starts with a fixed character,
+	 * ignoring any closing parentheses. If not, return a conservative result. */
+	for (;;) {
+		if (startst >= stopst)
+			return res;
+		if (OP(g->strip[startst]) != ORPAREN)
+			break;
+		startst++;
+	}
+	if (OP(g->strip[startst]) != OCHAR)
+		return res;
+
+	/* Find the character that starts the following match. */
+	char ch = OPND(g->strip[startst]);
+	for (; res != start; --res) {
+		if (*res == ch)
+			break;
+	}
+	return res;
+}
+
 /*
  - dissect - figure out what matched what, no back references
  */
@@ -358,7 +392,7 @@
 				if (tail == stop)
 					break;		/* yes! */
 				/* no -- try a shorter match for this one */
-				stp = rest - 1;
+				stp = step_back(m->g, sp, rest, es, stopst);
 				assert(stp >= sp);	/* it did work */
 			}
 			ssub = ss + 1;
@@ -383,7 +417,7 @@
 				if (tail == stop)
 					break;		/* yes! */
 				/* no -- try a shorter match for this one */
-				stp = rest - 1;
+				stp = step_back(m->g, sp, rest, es, stopst);
 				assert(stp >= sp);	/* it did work */
 			}
 			ssub = ss + 1;
@@ -1032,3 +1066,4 @@
 #undef	at
 #undef	match
 #undef	nope
+#undef	step_back


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D123776.422796.patch
Type: text/x-patch
Size: 2202 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220414/8e01cc90/attachment.bin>


More information about the llvm-commits mailing list