[llvm] 02a3754 - [Windows] Avoid using FileIndex for unique IDs

Martin Storsjö via llvm-commits llvm-commits at lists.llvm.org
Tue Sep 12 12:12:38 PDT 2023


Author: Martin Storsjö
Date: 2023-09-12T22:11:42+03:00
New Revision: 02a37547834902ed05fa9c5d1dcc9e31c37e2182

URL: https://github.com/llvm/llvm-project/commit/02a37547834902ed05fa9c5d1dcc9e31c37e2182
DIFF: https://github.com/llvm/llvm-project/commit/02a37547834902ed05fa9c5d1dcc9e31c37e2182.diff

LOG: [Windows] Avoid using FileIndex for unique IDs

The FileIndex values returned from GetFileInformationByHandle are
considered stable and uniquely identifying a file, as long as the
handle is open. When handles are closed, there are no guarantees
for their stability or uniqueness. On some file systems (such as
NTFS), the indices are documented to be stable even across handles.
But with some file systems, in particular network mounts, file
indices can be reused very soon after handles are closed.

When such file indices are used for LLVM's UniqueID, files are
considered duplicates as soon as the filesystem driver happens to
have used the same file index for the handle used to inspect the
file. This caused widespread, non-obvious (seemingly random)
breakage. This can happen e.g. if running on a directory that is
shared via Remote Desktop or VirtualBox.

To avoid the issue, use a hash of the canonicalized path for the
file as unique identifier, instead of using FileIndex.

This fixes https://github.com/llvm/llvm-project/issues/61401 and
https://github.com/llvm/llvm-project/issues/22079.

Performance wise, this adds (usually) one extra call to
GetFinalPathNameByHandleW for each call to getStatus(). A test
cases such as running clang-scan-deps becomes around 1% slower
by this, which is considered tolerable.

Change the equivalent() function to use getUniqueID instead of
checking individual file_status fields. The
equivalent(Twine,Twine,bool& result) function calls status() on
each path successively, without keeping the file handles open,
which also is prone to such false positives. This also gets rid
of checks of other superfluous fields in the
equivalent(file_status, file_status) function - the unique ID of
a file should be enough (that is what is done for Unix anyway).

This comes with one known caveat: For hardlinks, each name for
the file now gets a different UniqueID, and equivalent() considers
them different. While that's not ideal, occasional false negatives
for equivalent() is usually that fatal (the cases where we strictly
do need to deduplicate files with different path names are quite
rare) compared to the issues caused by false positives for
equivalent() (where we'd deduplicate and omit totally distinct files).

The FileIndex is documented to be stable on NTFS though, so ideally
we could maybe have used it in the majority of cases. That would
require a heuristic for whether we can rely on FileIndex or not.
We considered using the existing function is_local_internal for that;
however that caused an unacceptable performance regression
(clang-scan-deps became 38% slower in one test, even more than that
in another test).

Differential Revision: https://reviews.llvm.org/D155579

Added: 
    

Modified: 
    llvm/docs/ReleaseNotes.rst
    llvm/include/llvm/Support/FileSystem.h
    llvm/lib/Support/Windows/Path.inc
    llvm/unittests/Support/Path.cpp

Removed: 
    


################################################################################
diff  --git a/llvm/docs/ReleaseNotes.rst b/llvm/docs/ReleaseNotes.rst
index 533c93e62e5e116..5a7b2914d43c31c 100644
--- a/llvm/docs/ReleaseNotes.rst
+++ b/llvm/docs/ReleaseNotes.rst
@@ -117,6 +117,14 @@ Changes to the WebAssembly Backend
 Changes to the Windows Target
 -----------------------------
 
+* The LLVM filesystem class ``UniqueID`` and function ``equivalent()``
+  no longer determine that distinct 
diff erent path names for the same
+  hard linked file actually are equal. This is an intentional tradeoff in a
+  bug fix, where the bug used to cause distinct files to be considered
+  equivalent on some file systems. This change fixed the issues
+  https://github.com/llvm/llvm-project/issues/61401 and
+  https://github.com/llvm/llvm-project/issues/22079.
+
 Changes to the X86 Backend
 --------------------------
 

diff  --git a/llvm/include/llvm/Support/FileSystem.h b/llvm/include/llvm/Support/FileSystem.h
index 905ae961ec834c9..0a9c576027f96e1 100644
--- a/llvm/include/llvm/Support/FileSystem.h
+++ b/llvm/include/llvm/Support/FileSystem.h
@@ -233,8 +233,7 @@ class file_status : public basic_file_status {
   #elif defined (_WIN32)
   uint32_t NumLinks = 0;
   uint32_t VolumeSerialNumber = 0;
-  uint32_t FileIndexHigh = 0;
-  uint32_t FileIndexLow = 0;
+  uint64_t PathHash = 0;
   #endif
 
 public:
@@ -255,13 +254,12 @@ class file_status : public basic_file_status {
               uint32_t LastAccessTimeHigh, uint32_t LastAccessTimeLow,
               uint32_t LastWriteTimeHigh, uint32_t LastWriteTimeLow,
               uint32_t VolumeSerialNumber, uint32_t FileSizeHigh,
-              uint32_t FileSizeLow, uint32_t FileIndexHigh,
-              uint32_t FileIndexLow)
+              uint32_t FileSizeLow, uint64_t PathHash)
       : basic_file_status(Type, Perms, LastAccessTimeHigh, LastAccessTimeLow,
                           LastWriteTimeHigh, LastWriteTimeLow, FileSizeHigh,
                           FileSizeLow),
         NumLinks(LinkCount), VolumeSerialNumber(VolumeSerialNumber),
-        FileIndexHigh(FileIndexHigh), FileIndexLow(FileIndexLow) {}
+        PathHash(PathHash) {}
   #endif
 
   UniqueID getUniqueID() const;

diff  --git a/llvm/lib/Support/Windows/Path.inc b/llvm/lib/Support/Windows/Path.inc
index b949b724509f65b..744b6ff80d00384 100644
--- a/llvm/lib/Support/Windows/Path.inc
+++ b/llvm/lib/Support/Windows/Path.inc
@@ -158,12 +158,7 @@ std::string getMainExecutable(const char *argv0, void *MainExecAddr) {
 }
 
 UniqueID file_status::getUniqueID() const {
-  // The file is uniquely identified by the volume serial number along
-  // with the 64-bit file identifier.
-  uint64_t FileID = (static_cast<uint64_t>(FileIndexHigh) << 32ULL) |
-                    static_cast<uint64_t>(FileIndexLow);
-
-  return UniqueID(VolumeSerialNumber, FileID);
+  return UniqueID(VolumeSerialNumber, PathHash);
 }
 
 ErrorOr<space_info> disk_space(const Twine &Path) {
@@ -362,16 +357,17 @@ std::error_code is_local(const Twine &path, bool &result) {
 }
 
 static std::error_code realPathFromHandle(HANDLE H,
-                                          SmallVectorImpl<wchar_t> &Buffer) {
+                                          SmallVectorImpl<wchar_t> &Buffer,
+                                          DWORD flags = VOLUME_NAME_DOS) {
   Buffer.resize_for_overwrite(Buffer.capacity());
   DWORD CountChars = ::GetFinalPathNameByHandleW(
-      H, Buffer.begin(), Buffer.capacity(), FILE_NAME_NORMALIZED);
+      H, Buffer.begin(), Buffer.capacity(), FILE_NAME_NORMALIZED | flags);
   if (CountChars && CountChars >= Buffer.capacity()) {
     // The buffer wasn't big enough, try again.  In this case the return value
     // *does* indicate the size of the null terminator.
     Buffer.resize_for_overwrite(CountChars);
     CountChars = ::GetFinalPathNameByHandleW(H, Buffer.begin(), Buffer.size(),
-                                             FILE_NAME_NORMALIZED);
+                                             FILE_NAME_NORMALIZED | flags);
   }
   Buffer.truncate(CountChars);
   if (CountChars == 0)
@@ -647,12 +643,7 @@ bool can_execute(const Twine &Path) {
 
 bool equivalent(file_status A, file_status B) {
   assert(status_known(A) && status_known(B));
-  return A.FileIndexHigh == B.FileIndexHigh &&
-         A.FileIndexLow == B.FileIndexLow && A.FileSizeHigh == B.FileSizeHigh &&
-         A.FileSizeLow == B.FileSizeLow &&
-         A.LastWriteTimeHigh == B.LastWriteTimeHigh &&
-         A.LastWriteTimeLow == B.LastWriteTimeLow &&
-         A.VolumeSerialNumber == B.VolumeSerialNumber;
+  return A.getUniqueID() == B.getUniqueID();
 }
 
 std::error_code equivalent(const Twine &A, const Twine &B, bool &result) {
@@ -698,6 +689,7 @@ static perms perms_from_attrs(DWORD Attrs) {
 }
 
 static std::error_code getStatus(HANDLE FileHandle, file_status &Result) {
+  SmallVector<wchar_t, MAX_PATH> ntPath;
   if (FileHandle == INVALID_HANDLE_VALUE)
     goto handle_status_error;
 
@@ -725,13 +717,37 @@ static std::error_code getStatus(HANDLE FileHandle, file_status &Result) {
   if (!::GetFileInformationByHandle(FileHandle, &Info))
     goto handle_status_error;
 
+  // File indices aren't necessarily stable after closing the file handle;
+  // instead hash a canonicalized path.
+  //
+  // For getting a canonical path to the file, call GetFinalPathNameByHandleW
+  // with VOLUME_NAME_NT. We don't really care exactly what the path looks
+  // like here, as long as it is canonical (e.g. doesn't 
diff erentiate between
+  // whether a file was referred to with upper/lower case names originally).
+  // The default format with VOLUME_NAME_DOS doesn't work with all file system
+  // drivers, such as ImDisk. (See
+  // https://github.com/rust-lang/rust/pull/86447.)
+  uint64_t PathHash;
+  if (std::error_code EC =
+          realPathFromHandle(FileHandle, ntPath, VOLUME_NAME_NT)) {
+    // If realPathFromHandle failed, fall back on the fields
+    // nFileIndex{High,Low} instead. They're not necessarily stable on all file
+    // systems as they're only documented as being unique/stable as long as the
+    // file handle is open - but they're a decent fallback if we couldn't get
+    // the canonical path.
+    PathHash = (static_cast<uint64_t>(Info.nFileIndexHigh) << 32ULL) |
+               static_cast<uint64_t>(Info.nFileIndexLow);
+  } else {
+    PathHash = hash_combine_range(ntPath.begin(), ntPath.end());
+  }
+
   Result = file_status(
       file_type_from_attrs(Info.dwFileAttributes),
       perms_from_attrs(Info.dwFileAttributes), Info.nNumberOfLinks,
       Info.ftLastAccessTime.dwHighDateTime, Info.ftLastAccessTime.dwLowDateTime,
       Info.ftLastWriteTime.dwHighDateTime, Info.ftLastWriteTime.dwLowDateTime,
       Info.dwVolumeSerialNumber, Info.nFileSizeHigh, Info.nFileSizeLow,
-      Info.nFileIndexHigh, Info.nFileIndexLow);
+      PathHash);
   return std::error_code();
 
 handle_status_error:

diff  --git a/llvm/unittests/Support/Path.cpp b/llvm/unittests/Support/Path.cpp
index addfb45b4a92658..35a01aa27667933 100644
--- a/llvm/unittests/Support/Path.cpp
+++ b/llvm/unittests/Support/Path.cpp
@@ -708,12 +708,16 @@ TEST_F(FileSystemTest, Unique) {
 
   ASSERT_NO_ERROR(fs::remove(Twine(TempPath2)));
 
+#ifndef _WIN32
   // Two paths representing the same file on disk should still provide the
   // same unique id.  We can test this by making a hard link.
+  // FIXME: Our implementation of getUniqueID on Windows doesn't consider hard
+  // links to be the same file.
   ASSERT_NO_ERROR(fs::create_link(Twine(TempPath), Twine(TempPath2)));
   fs::UniqueID D2;
   ASSERT_NO_ERROR(fs::getUniqueID(Twine(TempPath2), D2));
   ASSERT_EQ(D2, F1);
+#endif
 
   ::close(FileDescriptor);
 
@@ -909,12 +913,16 @@ TEST_F(FileSystemTest, TempFiles) {
 
   // Create a hard link to Temp1.
   ASSERT_NO_ERROR(fs::create_link(Twine(TempPath), Twine(TempPath2)));
+#ifndef _WIN32
+  // FIXME: Our implementation of equivalent() on Windows doesn't consider hard
+  // links to be the same file.
   bool equal;
   ASSERT_NO_ERROR(fs::equivalent(Twine(TempPath), Twine(TempPath2), equal));
   EXPECT_TRUE(equal);
   ASSERT_NO_ERROR(fs::status(Twine(TempPath), A));
   ASSERT_NO_ERROR(fs::status(Twine(TempPath2), B));
   EXPECT_TRUE(fs::equivalent(A, B));
+#endif
 
   // Remove Temp1.
   ::close(FileDescriptor);


        


More information about the llvm-commits mailing list