[cfe-dev] [libc++][RFC] Implementing Directory Entry Caching for Filesystem

Davis Herring via cfe-dev cfe-dev at lists.llvm.org
Wed Jul 18 10:49:53 PDT 2018


> As I'll explain below, each different approach can cause non-obvious
> behavior, possibly leading to TOCTOU bugs and security
> vulnerabilities!
Anyone who has real TOCTOU concerns (like operating in directories 
controlled by other users) has to do special things anyway: in C++17, 
they can current_path("foo/bar") to avoid re-resolving foo and bar on 
future lookups, while in POSIX the better answer is to use the *at 
functions to avoid interference between threads and libraries via the 
process-wide shared CWD state.

> Above, a TOCTOU violation may occur if another process acts on the
> file referenced by `ent`, changing the attributes between the point
> the directory_entry was created and the point at which the
> `is_symlink` condition is checked.
But another process can already act between the is_symlink and the call 
to remove.  Absent a userspace dentry-level interface, it's not clear 
what the additional concern is from the is_symlink value being cached.

> However, at least the above example makes it clear *when* the check
> is occuring.

Why does it matter that the location of the check is obvious so long as 
(it's at least as obvious that) there's a race window after it?

> Using the `refresh` function, either called directly by the user, or
> implicitly by the constructors or modifying methods (Ex. `assign`).
> `refresh` has the ability to fully populate the cache

It would certainly be nice to cache all the results from [l]stat(2) 
(C++17 Late NB comment 20).  However, I'm not sure that you can cache 
anything more in refresh() than in directory_iterator::operator++.  The 
normative encouragement in [fs.class.directory_entry]/2 specifically 
says "during directory iteration" without suggesting that other 
attributes could be cached at other times.

Of course, then I'm not sure that even d_type can be cached: 
[fs.dir.entry.obs]/27 says that a file_status is either cached or not, 
and it also contains the permissions (from stat(2)).

>   if (!ent.is_symlink())
>    return false;
> 
>  // We should have the file_status for the symlink cached, right?
>  file_status st = ent.symlink_status();
> 
>  // Nope! Only when `refresh` has been called are the perms cached.
>  assert(is_symlink(st)); // May fail!

The assertion failure is surprising only if the author assumed that the 
implementation was caching anything.  Caching of any particular 
attribute can only increase consistency with regard to it, in that there 
are fewer system calls that might observe updates.  So code that is 
correct without caching will not become more racy when it is introduced.

> bool  is_empty_regular_file(path p) {
>  directory_entry ent(p);   // Calls refresh()
>  return ent.is_regular_file() && ent.file_size() == 0;
> }

It's fine to write code this way in an attempt to capitalize on caching 
as an optimization, but any coherence it adds is non-portable.  There is 
simply no way in C++17 to make this function reliable (thus the NB 
comment).  Of course, there's only so much a race-free implementation 
could get you, given the race window before any subsequent operation.

> Worse yet, the latter case, which fully populates the cache
> non-atomically, potentially commits a TOCTOU violation itself, and
> this violation is hidden from the user, preventing them from knowing
> it exists or being able to do anything about it.
First, note that both this caching and is_empty_regular_file (without 
any caching) can produce "impossible" results:

In a directory that never contains any empty regular files, 
is_empty_regular_file can return true because a regular file is replaced 
with an empty non-regular file (e.g., a device file).

In a directory that never contains a symlink to a directory, a 
non-atomic caching directory_entry can have is_symlink and is_directory 
true simultaneously because a symlink was replaced with a directory (or 
vice versa).

The latter case concerns me a bit more: because the order of the system 
calls might not match the order of the queries, they can return results 
inconsistent with what is known to be a single possibly-concurrent 
change to a file.  The motivated user can use refresh to recover from 
such an impossibility, while in the case of an unbounded number of 
concurrent changes nothing can be guaranteed anyway.

> Afterwards, a better and safer proposal can be put forward  which to 
> provides efficient and atomic access to multiple attributes for the
> same entity can be proposed.
It's hard to standardize atomic access to properties that are not 
atomically accessible everywhere.  But one could more easily entertain 
separate accessors for cached values (which return "unknown" as 
appropriate), so that the decision to use them (where available) falls 
to the user.

Davis

-- 
This product is sold by volume, not by mass.  If it appears too dense or 
too sparse, it is because mass-energy conversion has occurred during 
shipping.



More information about the cfe-dev mailing list