[cfe-dev] [libc++][RFC] Implementing Directory Entry Caching for Filesystem
Eric Fiselier via cfe-dev
cfe-dev at lists.llvm.org
Tue Jul 17 15:31:32 PDT 2018
On Tue, Jul 17, 2018 at 12:43 PM Eric Fiselier <eric at efcs.ca> wrote:
>
>
> On Tue, Jul 17, 2018 at 5:46 AM Aaron Ballman <aaron at aaronballman.com>
> wrote:
>
>> Thank you for putting this together, I greatly appreciate it!
>>
>> Out of curiosity, is the caching behavior something you expect you'll
>> make configurable for users of libc++ (for instance, a macro to enable
>> or disable caching), or are there issues preventing such an
>> implementation? My feeling is that we're going to have the usual push
>> and pull between security and performance where different users will
>> have different (and equally valid) use cases in mind justifying the
>> need for the performance provided by caching or the security red flags
>> enhanced by not caching. We should obviously pick a good default
>> behavior (and can argue over the definition of "good" in that
>> context), but I think no matter what option we pick, users will want
>> the other option to be available.
>>
>
> I hadn't thought of making it configurable, but I'm happy to do so.
> It'll have to maintain the same ABI, so the cache will simply be unused,
> but providing a configuration option seems feasible.
>
After attempting to implement a configuration option to disable caching, I
don't think it's possible.
Because so much of <filesystem> lives across a library boundary, it's
tricky to make that library
code sensitive to a macro defined later by the user.
That being said, perhaps we could file an issue to add
`directory_entry::clear_cache()` or similar,
so that users can explicitly make their `directory_entry` cacheless?
/Eric
>
>
>>
>> ~Aaron
>>
>> On Mon, Jul 16, 2018 at 10:39 PM, Eric Fiselier via cfe-dev
>> <cfe-dev at lists.llvm.org> wrote:
>> > Hi All,
>> >
>> >
>> > I have a couple of questions and concerns regarding implementations of
>> > P0317R1 [1], which I would like help from the community answering. The
>> paper
>> > changes `directory_entry` to cache a number of attributes, including
>> file
>> > status, permissions, file size, number of hard links, last write time,
>> and
>> > more. For reference, this is the interface of `directory_entry` before
>> this
>> > paper [2], and this is the interface after [3].
>> >
>> >
>> > The implementation has a lot of latitude over which attributes it
>> caches, if
>> > any, and how it caches them. As I'll explain below, each different
>> approach
>> > can cause non-obvious behavior, possibly leading to TOCTOU bugs and
>> security
>> > vulnerabilities!
>> >
>> >
>> > My question for the community is this:
>> >
>> >
>> > Given the considerations below, what should libc++ choose to do?
>> >
>> >
>> > 1. Cache nothing?
>> >
>> > 2. Cache as much as possible, when possible?
>> >
>> > 3. Something in between?
>> >
>> >
>> > I would like to land std::filesystem before the 7.0 branch, and doing so
>> > requires deciding on which implementation to provide now.
>> >
>> >
>> > Note that this paper only considers POSIX based implementations of
>> > filesystem.
>> >
>> > TOCTOU Violations
>> >
>> > ============================
>> >
>> >
>> > Time of check to time of use, or TOCTOU, is a class of software bugs
>> caused
>> > by changes in a system between the time a condition is checked, and the
>> time
>> > the result is used [4]. If libc++ chooses to do any caching in
>> > directory_entry, it will be vulnerable to this. If libc++ chooses to do
>> no
>> > caching at all, it will also be vulnerable to this but in different
>> > scenarios.
>> >
>> >
>> > Most importantly, whichever approach is taken, libc++ will bare some
>> > responsibility for the TOCTOU bugs it allows users to write accidently.
>> So
>> > it’s imperative we consider all approaches carefully.
>> >
>> >
>> > Let’s take a simple example:
>> >
>> >
>> > void remove_symlinks_in_dir(path p) {
>> >
>> > for (directory_entry ent : directory_iterator(p)) {
>> >
>> > if (ent.is_symlink())
>> >
>> > remove(ent);
>> >
>> > }
>> >
>> > }
>> >
>> >
>> > Above, a TOCTOU violation may occur if another process acts on the file
>> > referenced by `ent`, changing the attributes between the point the
>> > directory_entry was created and the point at which the `is_symlink`
>> > condition is checked.
>> >
>> >
>> > At this point it's important to note that <filesystem> interface is, in
>> > general, vulnerable to TOCTOU --- even without directory_entry caching.
>> For
>> > example:
>> >
>> >
>> > void remove_symlinks_in_dir(path p) {
>> >
>> > for (directory_entry ent : directory_iterator(p)) {
>> >
>> > if (is_symlink(ent))
>> >
>> > remove(ent);
>> >
>> > }
>> >
>> > }
>> >
>> >
>> > In the above case, a TOCTOU violation still occurs, since changes to
>> `ent`
>> > can take place between the check and the call to `remove`. However, at
>> least
>> > the above example makes it clear *when* the check is occuring.
>> >
>> >
>> > Simply eliminating caching will not prevent TOCTOU violations, but, in
>> this
>> > particular case, it would make them harder to write and make it more
>> obvious
>> > where they may occur. Having a cache both extends the period of time
>> between
>> > the check and the use, as well as hiding when the check actually occurs.
>> >
>> >
>> > Perhaps the above concessions make it worthwhile to support a more
>> efficient
>> > interface with caching and fewer file system accesses?
>> >
>> >
>> > If we do choose to cache some attributes, then, at minimum, it should be
>> > obvious to the user which values are cached and which are not. The
>> following
>> > section will explore the feasibility of this.
>> >
>> >
>> > Directory Iteration VS Refresh
>> >
>> > ==================================
>> >
>> >
>> > The paper's intention is to allow attributes returned during directory
>> > iteration to be accessed by the user without additional calls to the
>> > underlying filesystem.
>> >
>> >
>> > The cache is populated two different ways:
>> >
>> > During directory iteration, the cache is populated with only the
>> attributes
>> > returned by directory iteration function. With POSIX `readdir`, only the
>> > file type as returned by `lstat` is available.
>> >
>> > Using the `refresh` function, either called directly by the user, or
>> > implicitly by the constructors or modifying methods (Ex. `assign`).
>> > `refresh` has the ability to fully populate the cache (but not
>> atomically in
>> > some cases, see the symlink considerations below).
>> >
>> >
>> > Note that const accessors may not update the cache.
>> >
>> >
>> > Therefore the state of the cache depends on whether the
>> `directory_entry`
>> > was created during directory iteration or by the
>> `directory_entry(path&)`
>> > constructor, and whether there have been any intervening calls to
>> `refresh`
>> > or any other modifying member functions.
>> >
>> >
>> > The following case demonstrates a surprising TOCTOU violation, only
>> when the
>> > `directory_entry` cache was populated during directory iteration, and
>> not by
>> > a call to refresh:
>> >
>> >
>> > // Assume the filesystem supports permissions on symlinks.
>> >
>> > bool is_readable_symlink(directory_entry ent) {
>> >
>> > if (!ent.is_symlink())
>> >
>> > return false;
>> >
>> > // We should have the file_status for the symlink cached, right?
>> >
>> > file_status st = ent.symlink_status();
>> >
>> > // Nope! Only when `refresh` has been called are the perms cached.
>> >
>> > assert(is_symlink(st)); // May fail!
>> >
>> > return (st.permissions() & perms::others_read) != perms::none;
>> >
>> > }
>> >
>> >
>> > The above example is intended to demonstrate that the different
>> accessors
>> > may return a mix of cached and uncached values. Here `is_symlink()` is
>> > cached, but `symlink_status()` may not be.
>> >
>> > If the symlink has been replaced with a regular file, then the result of
>> > this function is potentially dangerous nonsense.
>> >
>> >
>> > So what happens when the user calls `refresh()` before accessing the
>> > attributes? Does this make it safer?
>> >
>> >
>> > The answer depends on the caching behavior of the implementation. and
>> > whether that behavior provides "atomic" or "non-atomic" attribute
>> access (as
>> > defined in the next section).
>> >
>> >
>> > Atomic VS Non-Atomic Attribute Access
>> >
>> > =============================================
>> >
>> >
>> > As previously described, TOCTOU violations require there to be
>> intervening
>> > time between the check and the use. We'll call any filesystem function,
>> or
>> > set of functions, which don’t allow for TOCTOU bugs "atomic". All other
>> > functions or set of functions are considered "non-atomic".
>> >
>> >
>> > Therefore, a `directory_entry` object can be said to provide "atomic
>> > attribute access" across all accessors if and only if (A) all of the
>> > attributes are cached, and (B) the cache was populated atomically. A
>> > non-caching implementation of `directory_entry` never provides atomic
>> > attribute access.
>> >
>> >
>> > Let’s consider the consequences using the following example:
>> >
>> >
>> > bool is_empty_regular_file(path p) {
>> >
>> > directory_entry ent(p); // Calls refresh()
>> >
>> > return ent.is_regular_file() && ent.file_size() == 0;
>> >
>> > }
>> >
>> >
>> > In a non-caching implementation `refresh()` is a nop, and therefore
>> opens
>> > the user up to TOCTOU issues occuring between the call to
>> `is_symlink()` and
>> > `is_regular_file()`.
>> >
>> >
>> > In a fully-caching implementation, which provides atomic attribute
>> access,
>> > no such problem occurs.
>> >
>> >
>> > In this case it seems preferable for libc++ to provide a fully-caching
>> > implementation with "atomic attribute access". But is this possible? The
>> > answer is "sometimes, but not always" as we'll see below.
>> >
>> >
>> > Problems with Symlinks And Refresh()
>> >
>> > ===========================================
>> >
>> >
>> > Above we established that if refresh caches at all, then it should do so
>> > "atomically", retrieving all attributes from the filesystem in a atomic
>> > manner. Note that for symlinks, some of the cached attributes refer to
>> the
>> > symlink (is_symlink() and `symlink_status()), while the rest refer to
>> the
>> > linked entity (is_regular_file(), file_size(), etc).
>> >
>> >
>> > When the directory entry does not refer to a symlink, a single call to
>> > `lstat` provides enough information to fully populate the cache
>> atomically.
>> > However, when the entity is a symlink, we would need a second call to
>> `stat`
>> > to determine the properties of the linked file.
>> >
>> >
>> > Therefore, "atomic attribute access" is not possible to guarantee.
>> >
>> >
>> > An implementation may choose to partially populate the caches, omitting
>> > attributes about the linked entity; or it may choose to fully populate
>> the
>> > cache non-atomically using a non-atomic series of calls to `lstat` and
>> > `stat`.
>> >
>> >
>> > The former case obfuscates from the user which attributes are cached and
>> > which are not, opening them up to TOCTOU violations.
>> >
>> >
>> > Worse yet, the latter case, which fully populates the cache
>> non-atomically,
>> > potentially commits a TOCTOU violation itself, and this violation is
>> hidden
>> > from the user, preventing them from knowing it exists or being able to
>> do
>> > anything about it.
>> >
>> >
>> > Both solutions seem fraught with problems.
>> >
>> > Conclusion
>> >
>> > ===============
>> >
>> >
>> > I hope to have shown the pros and cons of different directory entry
>> caching
>> > implementations. Now I need your help deciding what's best for Libc++!
>> >
>> >
>> > Personally, I think the best solution is to revert these changes to the
>> > standard, since each possible implementation seems broken in one way or
>> > another. Directory entries should not cache, and directory iterators
>> should
>> > not attempt to populate the cache incompletely. Afterwards, a better and
>> > safer proposal can be put forward which to provides efficient and
>> atomic
>> > access to multiple attributes for the same entity can be proposed.
>> >
>> >
>> > Unfortunately, it might be to late at this point to revert the paper,
>> even
>> > if the committee generally agrees with these problems.
>> >
>> >
>> > Thoughts?
>> >
>> >
>> >
>> > [1]
>> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0317r1.html
>> >
>> > [2] https://en.cppreference.com/w/cpp/experimental/fs/directory_entry
>> >
>> > [3] https://en.cppreference.com/w/cpp/filesystem/directory_entry
>> >
>> > [4] https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > cfe-dev mailing list
>> > cfe-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180717/0b7495ef/attachment.html>
More information about the cfe-dev
mailing list