[llvm-bugs] [Bug 40904] New: regex_search on MacOS gives wrong results when \D found in a character class

via llvm-bugs llvm-bugs at lists.llvm.org
Thu Feb 28 08:33:12 PST 2019


https://bugs.llvm.org/show_bug.cgi?id=40904

            Bug ID: 40904
           Summary: regex_search on MacOS gives wrong results when \D
                    found in a character class
           Product: libc++
           Version: unspecified
          Hardware: Macintosh
                OS: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: All Bugs
          Assignee: unassignedclangbugs at nondot.org
          Reporter: tom at kera.name
                CC: llvm-bugs at lists.llvm.org, mclow.lists at gmail.com

Pre-C++20, there's no way to turn on /s, so instead of a pattern like /ab.cd/
(where the third character could be a newline) we must write something like
/ab[/d/D]cd/ (using the union of "digits" and "non-digits" to match "any
character").

Unfortunately, libc++ doesn't match properly on this.

Example:

  #include <regex>
  #include <string>
  #include <iostream>
  #include <iomanip>

  int main()
  {
      const std::string input = "abZcd";
      char const* pattern = R"REGEX(^ab[\d\D]cd)REGEX";

      std::regex::flag_type flags = std::regex_constants::ECMAScript;
      std::regex re(pattern, flags);

      std::cout << std::boolalpha << std::regex_search(input.cbegin(),
input.cend(), re) << '\n';
  }

Output is "false" with:

  $ clang --version
  Apple LLVM version 10.0.0 (clang-1000.10.44.4)
  Target: x86_64-apple-darwin18.2.0
  Thread model: posix
  InstalledDir: /Library/Developer/CommandLineTools/usr/bin

But "true" (as expected) with g++ (GCC) 8.2.0.

Looking into it a bit, here are the results with some variants:

Pattern        Input    Should match?    Matches?
-------------------------------------------------
/^ab[\d\D]cd/  abZcd        Yes             No      <--- !
/^ab[\d\D]cd/  ab5cd        Yes             No      <--- !
/^ab[\D]cd/    abZcd        Yes             No      <--- !
/^ab\Dcd/      abZcd        Yes             Yes
/^ab[\d]cd/    ab5cd        Yes             Yes
/^ab\dcd/      ab5cd        Yes             Yes
/^ab\dcd/      abZcd        No              No
/^ab\Dcd/      ab5cd        No              No

The common feature amongst the three failures is the \D inside a character
class.

The behaviour is the same when switching to std::regex_match.

For added fun, I get the expected results on Linux:

  $ clang++ --version
  clang version 5.0.0-3~16.04.1 (tags/RELEASE_500/final)
  Target: x86_64-pc-linux-gnu
  Thread model: posix
  InstalledDir: /usr/bin

Related to bug 21363 (locale fun)?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20190228/fa9e57f7/attachment.html>


More information about the llvm-bugs mailing list