<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor"

   href="https://bugs.llvm.org/show_bug.cgi?id=40390">40390</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libc++

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>unspecified

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>All Bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedclangbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>tom@kera.name

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org, mclow.lists@gmail.com

          </td>

        </tr></table>

      <p>

        <div>

        <pre>I first raised this on SO (<a href="https://stackoverflow.com/q/54237547/560648">https://stackoverflow.com/q/54237547/560648</a>), on

which I have posted some benchmarks to back up the claim(s) below.

Take the following:

#include <regex>

int main()

{

  static const size_t BufSize = 100;

  char buf[BufSize] = {};

  auto begin = std::cbegin(buf), end = std::cend(buf);

  std::cmatch groups;

  std::regex::flag_type flags = std::regex_constants::ECMAScript;

  std::regex re("^what", flags);

  std::regex_search(begin, end, groups, re);

}

This attempts to match the pattern "^what" against a block of 100 characters.

The match is not expected to succeed (in this case, the input is simply 100

'\0's, but the problem exists for any non-matching input).

However, I expect the match to fail as soon as the first character of input is

examined. By adjusting BufSize to increasingly large values, we observe that

the execution time increases also, suggesting that the regex engine is

examining the entire input even though the presence of the anchor "^"

guarantees that a match will never be found. It only needed to examine the

first character to know this. When BufSize reaches larger values like 100KB,

this becomes quite problematic.

It is clear from the implementation code

(<a href="https://github.com/llvm-mirror/libcxx/blob/master/include/regex#L5859-L5897">https://github.com/llvm-mirror/libcxx/blob/master/include/regex#L5859-L5897</a>)

that there is simply no logic in place to "fail fast" or "fail early" in a case

like this: the only way a "no match" result is returned is after examining the

whole input, regardless of the pattern.

It is my opinion that this is a quality of implementation issue, and one that

only appears in C++ implementations of regular expressions. This problem is

common to libstdc++, libc++ and also Visual Studio's stdlib impl. (I am raising

bugs against all three.)

As a workaround I'm having to artificially select a prefix of the input data in

order to get a fast result -- in the example above, that could be:

  auto begin = std::cbegin(buf), end = std::cbegin(buf)+4;

However, not all examples are so trivial (indeed, the example above would be

much better approached with a simple string prefix comparison) and the

workaround not always so easy. When the pattern is more complex, it is not

always easy to find the best number of characters to send to the regex engine,

and the resulting code not particularly elegant. It would be much better if the

engine could be given the whole input without having to worry about scale.

Hopefully my expectation isn't unreasonable; Safari's implementation of regex

behaves as I'd expect. That is, the time to return a "no match" result is

constant (and fast) given the JS equivalent of the above example.

Is it possible that the regex_match implementation could be given a little more

intelligence?

(Apologies that I am not sufficiently familiar with libc++ version history to

select an appropriate version number for this bug.)</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>