[llvm] [TableGen] Fix prefix detection with anchor (NFC) (PR #71379)

Nikita Popov via llvm-commits llvm-commits at lists.llvm.org
Mon Nov 6 03:10:03 PST 2023


https://github.com/nikic created https://github.com/llvm/llvm-project/pull/71379

instregex uses an optimization, where the constant prefix of the regex is extracted to perform a binary search first. However, this optimization currently mainly fails to apply, because most instregex uses have an explicit ^ anchor, which gets counted as a meta char and disables the optimization.

Make sure the anchor is skipped when determining the prefix. Also fix an implementation bug this exposes, where the pick a too long prefix if the first meta character is a quantifier.

This cuts the time needed to generate files like X86GenInstrInfo.inc by half.

>From 64f5c675bb19eb8efa13e0e861aba03d0df7ebcd Mon Sep 17 00:00:00 2001
From: Nikita Popov <npopov at redhat.com>
Date: Mon, 6 Nov 2023 11:10:00 +0100
Subject: [PATCH] [TableGen] Fix prefix detection with anchor (NFC)

instregex uses an optimization, where the constant prefix of the
regex is extracted to perform a binary search first. However,
this optimization currently mainly fails to apply, because most
instregex uses have an explicit ^ anchor, which gets counted as
a meta char and disables the optimization.

Make sure the anchor is skipped when determining the prefix. Also
fix an implementation bug this exposes, where the pick a too long
prefix if the first meta character is a quantifier.

This cuts the time needed to generate files like X86GenInstrInfo.inc
by half.
---
 llvm/utils/TableGen/CodeGenSchedule.cpp | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/llvm/utils/TableGen/CodeGenSchedule.cpp b/llvm/utils/TableGen/CodeGenSchedule.cpp
index c3c5e4f8eb2d8c3..54463da19821476 100644
--- a/llvm/utils/TableGen/CodeGenSchedule.cpp
+++ b/llvm/utils/TableGen/CodeGenSchedule.cpp
@@ -91,10 +91,25 @@ struct InstRegexOp : public SetTheory::Operator {
         PrintFatalError(Loc, "instregex requires pattern string: " +
                                  Expr->getAsString());
       StringRef Original = SI->getValue();
+      // Drop an explicit ^ anchor to not interfere with prefix search.
+      bool HadAnchor = Original.consume_front("^");
 
       // Extract a prefix that we can binary search on.
       static const char RegexMetachars[] = "()^$|*+?.[]\\{}";
       auto FirstMeta = Original.find_first_of(RegexMetachars);
+      if (FirstMeta != StringRef::npos && FirstMeta > 0) {
+        // If we have a regex like ABC* we can only use AB as the prefix, as
+        // the * acts on C.
+        switch (Original[FirstMeta]) {
+        case '+':
+        case '*':
+        case '?':
+          --FirstMeta;
+          break;
+        default:
+          break;
+        }
+      }
 
       // Look for top-level | or ?. We cannot optimize them to binary search.
       if (removeParens(Original).find_first_of("|?") != std::string::npos)
@@ -106,7 +121,10 @@ struct InstRegexOp : public SetTheory::Operator {
       if (!PatStr.empty()) {
         // For the rest use a python-style prefix match.
         std::string pat = std::string(PatStr);
-        if (pat[0] != '^') {
+        // Add ^ anchor. If we had one originally, don't need the group.
+        if (HadAnchor) {
+          pat.insert(0, "^");
+        } else {
           pat.insert(0, "^(");
           pat.insert(pat.end(), ')');
         }



More information about the llvm-commits mailing list