<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - directory_iterator assert crash on Unicode input on Windows"

   href="https://bugs.llvm.org/show_bug.cgi?id=46236">46236</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>directory_iterator assert crash on Unicode input on Windows

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Windows 2000

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Support Libraries

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>andrey@futoin.org

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>In short: there is a bug that UTF-8 byte length is used in UTF-16 condition

checks whats leads to out-of-range assertion.

Why so: UTF-8 may be longer in bytes than UTF-16 in wchar_t

Severity: unlikely to be critical, but I have spotted the problem in

third-party software. It seems only a few LLVM/clang internals use the

functionality.

A short obvious blind bug fix, that explains everything, but not tested:

diff --git a/llvm/lib/Support/Windows/Path.inc

b/llvm/lib/Support/Windows/Path.inc

index ec62e656ddf..49fc8dbdfb0 100644

--- a/llvm/lib/Support/Windows/Path.inc

+++ b/llvm/lib/Support/Windows/Path.inc

@@ -941,32 +941,32 @@ static basic_file_status

status_from_find_data(WIN32_FIND_DATAW *FindData) {

                            FindData->ftLastWriteTime.dwHighDateTime,

                            FindData->ftLastWriteTime.dwLowDateTime,

                            FindData->nFileSizeHigh, FindData->nFileSizeLow);

 }

 std::error_code detail::directory_iterator_construct(detail::DirIterState &IT,

                                                      StringRef Path,

                                                      bool FollowSymlinks) {

   SmallVector<wchar_t, 128> PathUTF16;

   if (std::error_code EC = widenPath(Path, PathUTF16))

     return EC;

   // Convert path to the format that Windows is happy with.

   if (PathUTF16.size() > 0 &&

-      !is_separator(PathUTF16[Path.size() - 1]) &&

-      PathUTF16[Path.size() - 1] != L':') {

+      !is_separator(PathUTF16[PathUTF16.size() - 1]) &&

+      PathUTF16[PathUTF16.size() - 1] != L':') {

     PathUTF16.push_back(L'\\');

     PathUTF16.push_back(L'*');

   } else {

     PathUTF16.push_back(L'*');

   }

   //  Get the first directory entry.

   WIN32_FIND_DATAW FirstFind;

   ScopedFindHandle FindHandle(::FindFirstFileExW(

       c_str(PathUTF16), FindExInfoBasic, &FirstFind, FindExSearchNameMatch,

       NULL, FIND_FIRST_EX_LARGE_FETCH));

   if (!FindHandle)

     return mapWindowsError(::GetLastError());

   size_t FilenameLen = ::wcslen(FirstFind.cFileName);</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>