[cfe-dev] [libc++][RFC] Implementing Directory Entry Caching for Filesystem

Eric Fiselier via cfe-dev cfe-dev at lists.llvm.org
Mon Jul 16 19:39:45 PDT 2018


* Hi All,I have a couple of questions and concerns regarding
implementations of P0317R1
<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0317r1.html> [1],
which I would like help from the community answering. The paper changes
`directory_entry` to cache a number of attributes, including file status,
 permissions, file size, number of hard links, last write time, and more.
For reference, this is the interface
<https://en.cppreference.com/w/cpp/experimental/fs/directory_entry> of
`directory_entry` before this paper [2], and this is the interface
<https://en.cppreference.com/w/cpp/filesystem/directory_entry> after
[3].The implementation has a lot of latitude over which attributes it
caches, if any, and how it caches them. As I'll explain below, each
different approach can cause non-obvious behavior, possibly leading to
TOCTOU bugs and security vulnerabilities!My question for the community is
this:Given the considerations below, what should libc++ choose to do?1.
Cache nothing?2. Cache as much as possible, when possible?3. Something in
between?I would like to land std::filesystem before the 7.0 branch, and
doing so requires deciding on which implementation to provide now. Note
that this paper only considers POSIX based implementations of
filesystem.TOCTOU Violations============================Time of check to
time of use, or TOCTOU, is a class of software bugs caused by changes in a
system between the time a condition is checked, and the time the result is
used [4]. If libc++ chooses to do any caching in directory_entry, it will
be vulnerable to this. If libc++ chooses to do no caching at all, it will
also be vulnerable to this but in different scenarios.Most importantly,
whichever approach is taken, libc++ will bare some responsibility for the
TOCTOU bugs it allows users to write accidently. So it’s imperative we
consider all approaches carefully.Let’s take a simple example:void
remove_symlinks_in_dir(path p) {  for (directory_entry ent :
directory_iterator(p)) {    if (ent.is_symlink())      remove(ent);
 }}Above, a TOCTOU violation may occur if another process acts on the file
referenced by `ent`, changing the attributes between the point the
directory_entry was created and the point at which the `is_symlink`
condition is checked.At this point it's important to note that <filesystem>
interface is, in general, vulnerable to TOCTOU --- even without
directory_entry caching. For example:void remove_symlinks_in_dir(path p) {
 for (directory_entry ent : directory_iterator(p)) {    if
(is_symlink(ent))      remove(ent);  }}In the above case, a TOCTOU
violation still occurs, since changes to `ent` can take place between the
check and the call to `remove`. However, at least the above example makes
it clear *when* the check is occuring.Simply eliminating caching will not
prevent TOCTOU violations, but, in this particular case, it would make them
harder to write and make it more obvious where they may occur. Having a
cache both extends the period of time between the check and the use, as
well as hiding when the check actually occurs.Perhaps the above concessions
make it worthwhile to support a more efficient interface with caching and
fewer file system accesses?If we do choose to cache some attributes, then,
at minimum, it should be obvious to the user which values are cached and
which are not. The following section will explore the feasibility of
this.Directory Iteration VS Refresh==================================The
paper's intention is to allow attributes returned during directory
iteration to be accessed by the user without additional calls to the
underlying filesystem. The cache is populated two different ways: - During
directory iteration, the cache is populated with only the attributes
returned by directory iteration function. With POSIX `readdir`, only the
file type as returned by `lstat` is available.- Using the `refresh`
function, either called directly by the user, or implicitly by the
constructors or modifying methods (Ex. `assign`). `refresh` has the ability
to fully populate the cache (but not atomically in some cases, see the
symlink considerations below).Note that const accessors may not update the
cache. Therefore the state of the cache depends on whether the
`directory_entry` was created during directory iteration or by the
`directory_entry(path&)` constructor, and whether there have been any
intervening calls to `refresh` or any other modifying member functions. The
following case demonstrates a surprising TOCTOU violation, only when the
`directory_entry` cache was populated during directory iteration, and not
by a call to refresh:// Assume the filesystem supports permissions on
symlinks.bool is_readable_symlink(directory_entry ent) {  if
(!ent.is_symlink())    return false;  // We should have the file_status for
the symlink cached, right?  file_status st = ent.symlink_status();  //
Nope! Only when `refresh` has been called are the perms cached.
 assert(is_symlink(st)); // May fail!  return (st.permissions() &
perms::others_read) != perms::none;}The above example is intended to
demonstrate that the different accessors may return a mix of cached and
uncached values. Here `is_symlink()` is cached, but `symlink_status()` may
not be. If the symlink has been replaced with a regular file, then the
result of this function is potentially dangerous nonsense.So what happens
when the user calls `refresh()` before accessing the attributes? Does this
make it safer?The answer depends on the caching behavior of the
implementation. and whether that behavior provides "atomic" or "non-atomic"
attribute access (as defined in the next section).Atomic VS Non-Atomic
Attribute Access=============================================As previously
described, TOCTOU violations require there to be intervening time between
the check and the use. We'll call any filesystem function, or set of
functions, which don’t allow for TOCTOU bugs "atomic". All other functions
or set of functions are considered "non-atomic".Therefore, a
`directory_entry` object can be said to provide "atomic attribute access"
across all accessors if and only if (A) all of the attributes are cached,
and (B) the cache was populated atomically. A non-caching implementation of
`directory_entry` never provides atomic attribute access.Let’s consider the
consequences using the following example:bool  is_empty_regular_file(path
p) {  directory_entry ent(p);   // Calls refresh()  return
ent.is_regular_file() && ent.file_size() == 0;}In a non-caching
implementation `refresh()` is a nop, and therefore opens the user up to
TOCTOU issues occuring between the call to `is_symlink()` and
`is_regular_file()`.In a fully-caching implementation, which provides
atomic attribute access, no such problem occurs.In this case it seems
preferable for libc++ to provide a fully-caching implementation with
"atomic attribute access". But is this possible? The answer is "sometimes,
but not always" as we'll see below.Problems with Symlinks And
Refresh()===========================================Above we established
that if refresh caches at all, then it should do so "atomically",
retrieving all attributes from the filesystem in a atomic manner. Note that
for symlinks, some of the cached attributes refer to the symlink
(is_symlink() and `symlink_status()), while the rest refer to the linked
entity (is_regular_file(), file_size(), etc).When the directory entry does
not refer to a symlink, a single call to `lstat` provides enough
information to fully populate the cache atomically. However, when the
entity is a symlink, we would need a second call to `stat` to determine the
properties of the linked file.Therefore, "atomic attribute access" is not
possible to guarantee.An implementation may choose to partially populate
the caches, omitting attributes about the linked entity; or it may choose
to fully populate the cache non-atomically using a non-atomic series of
calls to `lstat` and `stat`.The former case obfuscates from the user which
attributes are cached and which are not, opening them up to TOCTOU
violations.Worse yet, the latter case, which fully populates the cache
non-atomically, potentially commits a TOCTOU violation itself, and this
violation is hidden from the user, preventing them from knowing it exists
or being able to do anything about it.Both solutions seem fraught with
problems.Conclusion===============I hope to have shown the pros and cons of
different directory entry caching implementations. Now I need your help
deciding what's best for Libc++!Personally, I think the best solution is to
revert these changes to the standard, since each possible implementation
seems broken in one way or another. Directory entries should not cache, and
directory iterators should not attempt to populate the cache incompletely.
Afterwards, a better and safer proposal can be put forward  which to
provides efficient and atomic access to multiple attributes for the same
entity can be proposed.Unfortunately, it might be to late at this point to
revert the paper, even if the committee generally agrees with these
problems.Thoughts? [1]
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0317r1.html
<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0317r1.html>[2]
https://en.cppreference.com/w/cpp/experimental/fs/directory_entry
<https://en.cppreference.com/w/cpp/experimental/fs/directory_entry>[3]
https://en.cppreference.com/w/cpp/filesystem/directory_entry
<https://en.cppreference.com/w/cpp/filesystem/directory_entry>[4]
https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use
<https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use> *
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180716/64e15bba/attachment.html>


More information about the cfe-dev mailing list