<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - regex_search on MacOS gives wrong results when \D found in a character class"
   href="https://bugs.llvm.org/show_bug.cgi?id=40904">40904</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>regex_search on MacOS gives wrong results when \D found in a character class
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>libc++
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>unspecified
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>Macintosh
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>normal
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>All Bugs
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedclangbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>tom@kera.name
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>llvm-bugs@lists.llvm.org, mclow.lists@gmail.com
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Pre-C++20, there's no way to turn on /s, so instead of a pattern like /ab.cd/
(where the third character could be a newline) we must write something like
/ab[/d/D]cd/ (using the union of "digits" and "non-digits" to match "any
character").

Unfortunately, libc++ doesn't match properly on this.

Example:

  #include <regex>
  #include <string>
  #include <iostream>
  #include <iomanip>

  int main()
  {
      const std::string input = "abZcd";
      char const* pattern = R"REGEX(^ab[\d\D]cd)REGEX";

      std::regex::flag_type flags = std::regex_constants::ECMAScript;
      std::regex re(pattern, flags);

      std::cout << std::boolalpha << std::regex_search(input.cbegin(),
input.cend(), re) << '\n';
  }

Output is "false" with:

  $ clang --version
  Apple LLVM version 10.0.0 (clang-1000.10.44.4)
  Target: x86_64-apple-darwin18.2.0
  Thread model: posix
  InstalledDir: /Library/Developer/CommandLineTools/usr/bin

But "true" (as expected) with g++ (GCC) 8.2.0.

Looking into it a bit, here are the results with some variants:

Pattern        Input    Should match?    Matches?
-------------------------------------------------
/^ab[\d\D]cd/  abZcd        Yes             No      <--- !
/^ab[\d\D]cd/  ab5cd        Yes             No      <--- !
/^ab[\D]cd/    abZcd        Yes             No      <--- !
/^ab\Dcd/      abZcd        Yes             Yes
/^ab[\d]cd/    ab5cd        Yes             Yes
/^ab\dcd/      ab5cd        Yes             Yes
/^ab\dcd/      abZcd        No              No
/^ab\Dcd/      ab5cd        No              No

The common feature amongst the three failures is the \D inside a character
class.

The behaviour is the same when switching to std::regex_match.

For added fun, I get the expected results on Linux:

  $ clang++ --version
  clang version 5.0.0-3~16.04.1 (tags/RELEASE_500/final)
  Target: x86_64-pc-linux-gnu
  Thread model: posix
  InstalledDir: /usr/bin

Related to <a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - Investigate and fix failing regex tests on linux."
   href="show_bug.cgi?id=21363">bug 21363</a> (locale fun)?</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>