[llvm-bugs] [Bug 40390] New: regex_match doesn't fail early when given a non-matching pattern with a start-of-input anchor

Mon Jan 21 05:03:17 PST 2019

https://bugs.llvm.org/show_bug.cgi?id=40390

            Bug ID: 40390
           Summary: regex_match doesn't fail early when given a
                    non-matching pattern with a start-of-input anchor
           Product: libc++
           Version: unspecified
          Hardware: PC
                OS: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: All Bugs
          Assignee: unassignedclangbugs at nondot.org
          Reporter: tom at kera.name
                CC: llvm-bugs at lists.llvm.org, mclow.lists at gmail.com

I first raised this on SO (https://stackoverflow.com/q/54237547/560648), on
which I have posted some benchmarks to back up the claim(s) below.

Take the following:

#include <regex>
int main()
{
  static const size_t BufSize = 100;
  char buf[BufSize] = {};
  auto begin = std::cbegin(buf), end = std::cend(buf);

  std::cmatch groups;
  std::regex::flag_type flags = std::regex_constants::ECMAScript;
  std::regex re("^what", flags);
  std::regex_search(begin, end, groups, re);
}

This attempts to match the pattern "^what" against a block of 100 characters.
The match is not expected to succeed (in this case, the input is simply 100
'\0's, but the problem exists for any non-matching input).

However, I expect the match to fail as soon as the first character of input is
examined. By adjusting BufSize to increasingly large values, we observe that
the execution time increases also, suggesting that the regex engine is
examining the entire input even though the presence of the anchor "^"
guarantees that a match will never be found. It only needed to examine the
first character to know this. When BufSize reaches larger values like 100KB,
this becomes quite problematic.

It is clear from the implementation code
(https://github.com/llvm-mirror/libcxx/blob/master/include/regex#L5859-L5897)
that there is simply no logic in place to "fail fast" or "fail early" in a case
like this: the only way a "no match" result is returned is after examining the
whole input, regardless of the pattern.

It is my opinion that this is a quality of implementation issue, and one that
only appears in C++ implementations of regular expressions. This problem is
common to libstdc++, libc++ and also Visual Studio's stdlib impl. (I am raising
bugs against all three.)

As a workaround I'm having to artificially select a prefix of the input data in
order to get a fast result -- in the example above, that could be:

  auto begin = std::cbegin(buf), end = std::cbegin(buf)+4;

However, not all examples are so trivial (indeed, the example above would be
much better approached with a simple string prefix comparison) and the
workaround not always so easy. When the pattern is more complex, it is not
always easy to find the best number of characters to send to the regex engine,
and the resulting code not particularly elegant. It would be much better if the
engine could be given the whole input without having to worry about scale.

Hopefully my expectation isn't unreasonable; Safari's implementation of regex
behaves as I'd expect. That is, the time to return a "no match" result is
constant (and fast) given the JS equivalent of the above example.

Is it possible that the regex_match implementation could be given a little more
intelligence?

(Apologies that I am not sufficiently familiar with libc++ version history to
select an appropriate version number for this bug.)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20190121/c347c3cd/attachment.html>