<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Tue, Jul 17, 2018 at 5:46 AM Aaron Ballman <<a href="mailto:aaron@aaronballman.com">aaron@aaronballman.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Thank you for putting this together, I greatly appreciate it!<br>
<br>
Out of curiosity, is the caching behavior something you expect you'll<br>
make configurable for users of libc++ (for instance, a macro to enable<br>
or disable caching), or are there issues preventing such an<br>
implementation? My feeling is that we're going to have the usual push<br>
and pull between security and performance where different users will<br>
have different (and equally valid) use cases in mind justifying the<br>
need for the performance provided by caching or the security red flags<br>
enhanced by not caching. We should obviously pick a good default<br>
behavior (and can argue over the definition of "good" in that<br>
context), but I think no matter what option we pick, users will want<br>
the other option to be available.<br></blockquote><div><br></div><div>I hadn't thought of making it configurable, but I'm happy to do so.</div><div>It'll have to maintain the same ABI, so the cache will simply be unused,</div><div>but providing a configuration option seems feasible.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
~Aaron<br>
<br>
On Mon, Jul 16, 2018 at 10:39 PM, Eric Fiselier via cfe-dev<br>
<<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br>
> Hi All,<br>
><br>
><br>
> I have a couple of questions and concerns regarding implementations of<br>
> P0317R1 [1], which I would like help from the community answering. The paper<br>
> changes `directory_entry` to cache a number of attributes, including file<br>
> status, permissions, file size, number of hard links, last write time, and<br>
> more. For reference, this is the interface of `directory_entry` before this<br>
> paper [2], and this is the interface after [3].<br>
><br>
><br>
> The implementation has a lot of latitude over which attributes it caches, if<br>
> any, and how it caches them. As I'll explain below, each different approach<br>
> can cause non-obvious behavior, possibly leading to TOCTOU bugs and security<br>
> vulnerabilities!<br>
><br>
><br>
> My question for the community is this:<br>
><br>
><br>
> Given the considerations below, what should libc++ choose to do?<br>
><br>
><br>
> 1. Cache nothing?<br>
><br>
> 2. Cache as much as possible, when possible?<br>
><br>
> 3. Something in between?<br>
><br>
><br>
> I would like to land std::filesystem before the 7.0 branch, and doing so<br>
> requires deciding on which implementation to provide now.<br>
><br>
><br>
> Note that this paper only considers POSIX based implementations of<br>
> filesystem.<br>
><br>
> TOCTOU Violations<br>
><br>
> ============================<br>
><br>
><br>
> Time of check to time of use, or TOCTOU, is a class of software bugs caused<br>
> by changes in a system between the time a condition is checked, and the time<br>
> the result is used [4]. If libc++ chooses to do any caching in<br>
> directory_entry, it will be vulnerable to this. If libc++ chooses to do no<br>
> caching at all, it will also be vulnerable to this but in different<br>
> scenarios.<br>
><br>
><br>
> Most importantly, whichever approach is taken, libc++ will bare some<br>
> responsibility for the TOCTOU bugs it allows users to write accidently. So<br>
> it’s imperative we consider all approaches carefully.<br>
><br>
><br>
> Let’s take a simple example:<br>
><br>
><br>
> void remove_symlinks_in_dir(path p) {<br>
><br>
> for (directory_entry ent : directory_iterator(p)) {<br>
><br>
> if (ent.is_symlink())<br>
><br>
> remove(ent);<br>
><br>
> }<br>
><br>
> }<br>
><br>
><br>
> Above, a TOCTOU violation may occur if another process acts on the file<br>
> referenced by `ent`, changing the attributes between the point the<br>
> directory_entry was created and the point at which the `is_symlink`<br>
> condition is checked.<br>
><br>
><br>
> At this point it's important to note that <filesystem> interface is, in<br>
> general, vulnerable to TOCTOU --- even without directory_entry caching. For<br>
> example:<br>
><br>
><br>
> void remove_symlinks_in_dir(path p) {<br>
><br>
> for (directory_entry ent : directory_iterator(p)) {<br>
><br>
> if (is_symlink(ent))<br>
><br>
> remove(ent);<br>
><br>
> }<br>
><br>
> }<br>
><br>
><br>
> In the above case, a TOCTOU violation still occurs, since changes to `ent`<br>
> can take place between the check and the call to `remove`. However, at least<br>
> the above example makes it clear *when* the check is occuring.<br>
><br>
><br>
> Simply eliminating caching will not prevent TOCTOU violations, but, in this<br>
> particular case, it would make them harder to write and make it more obvious<br>
> where they may occur. Having a cache both extends the period of time between<br>
> the check and the use, as well as hiding when the check actually occurs.<br>
><br>
><br>
> Perhaps the above concessions make it worthwhile to support a more efficient<br>
> interface with caching and fewer file system accesses?<br>
><br>
><br>
> If we do choose to cache some attributes, then, at minimum, it should be<br>
> obvious to the user which values are cached and which are not. The following<br>
> section will explore the feasibility of this.<br>
><br>
><br>
> Directory Iteration VS Refresh<br>
><br>
> ==================================<br>
><br>
><br>
> The paper's intention is to allow attributes returned during directory<br>
> iteration to be accessed by the user without additional calls to the<br>
> underlying filesystem.<br>
><br>
><br>
> The cache is populated two different ways:<br>
><br>
> During directory iteration, the cache is populated with only the attributes<br>
> returned by directory iteration function. With POSIX `readdir`, only the<br>
> file type as returned by `lstat` is available.<br>
><br>
> Using the `refresh` function, either called directly by the user, or<br>
> implicitly by the constructors or modifying methods (Ex. `assign`).<br>
> `refresh` has the ability to fully populate the cache (but not atomically in<br>
> some cases, see the symlink considerations below).<br>
><br>
><br>
> Note that const accessors may not update the cache.<br>
><br>
><br>
> Therefore the state of the cache depends on whether the `directory_entry`<br>
> was created during directory iteration or by the `directory_entry(path&)`<br>
> constructor, and whether there have been any intervening calls to `refresh`<br>
> or any other modifying member functions.<br>
><br>
><br>
> The following case demonstrates a surprising TOCTOU violation, only when the<br>
> `directory_entry` cache was populated during directory iteration, and not by<br>
> a call to refresh:<br>
><br>
><br>
> // Assume the filesystem supports permissions on symlinks.<br>
><br>
> bool is_readable_symlink(directory_entry ent) {<br>
><br>
> if (!ent.is_symlink())<br>
><br>
> return false;<br>
><br>
> // We should have the file_status for the symlink cached, right?<br>
><br>
> file_status st = ent.symlink_status();<br>
><br>
> // Nope! Only when `refresh` has been called are the perms cached.<br>
><br>
> assert(is_symlink(st)); // May fail!<br>
><br>
> return (st.permissions() & perms::others_read) != perms::none;<br>
><br>
> }<br>
><br>
><br>
> The above example is intended to demonstrate that the different accessors<br>
> may return a mix of cached and uncached values. Here `is_symlink()` is<br>
> cached, but `symlink_status()` may not be.<br>
><br>
> If the symlink has been replaced with a regular file, then the result of<br>
> this function is potentially dangerous nonsense.<br>
><br>
><br>
> So what happens when the user calls `refresh()` before accessing the<br>
> attributes? Does this make it safer?<br>
><br>
><br>
> The answer depends on the caching behavior of the implementation. and<br>
> whether that behavior provides "atomic" or "non-atomic" attribute access (as<br>
> defined in the next section).<br>
><br>
><br>
> Atomic VS Non-Atomic Attribute Access<br>
><br>
> =============================================<br>
><br>
><br>
> As previously described, TOCTOU violations require there to be intervening<br>
> time between the check and the use. We'll call any filesystem function, or<br>
> set of functions, which don’t allow for TOCTOU bugs "atomic". All other<br>
> functions or set of functions are considered "non-atomic".<br>
><br>
><br>
> Therefore, a `directory_entry` object can be said to provide "atomic<br>
> attribute access" across all accessors if and only if (A) all of the<br>
> attributes are cached, and (B) the cache was populated atomically. A<br>
> non-caching implementation of `directory_entry` never provides atomic<br>
> attribute access.<br>
><br>
><br>
> Let’s consider the consequences using the following example:<br>
><br>
><br>
> bool is_empty_regular_file(path p) {<br>
><br>
> directory_entry ent(p); // Calls refresh()<br>
><br>
> return ent.is_regular_file() && ent.file_size() == 0;<br>
><br>
> }<br>
><br>
><br>
> In a non-caching implementation `refresh()` is a nop, and therefore opens<br>
> the user up to TOCTOU issues occuring between the call to `is_symlink()` and<br>
> `is_regular_file()`.<br>
><br>
><br>
> In a fully-caching implementation, which provides atomic attribute access,<br>
> no such problem occurs.<br>
><br>
><br>
> In this case it seems preferable for libc++ to provide a fully-caching<br>
> implementation with "atomic attribute access". But is this possible? The<br>
> answer is "sometimes, but not always" as we'll see below.<br>
><br>
><br>
> Problems with Symlinks And Refresh()<br>
><br>
> ===========================================<br>
><br>
><br>
> Above we established that if refresh caches at all, then it should do so<br>
> "atomically", retrieving all attributes from the filesystem in a atomic<br>
> manner. Note that for symlinks, some of the cached attributes refer to the<br>
> symlink (is_symlink() and `symlink_status()), while the rest refer to the<br>
> linked entity (is_regular_file(), file_size(), etc).<br>
><br>
><br>
> When the directory entry does not refer to a symlink, a single call to<br>
> `lstat` provides enough information to fully populate the cache atomically.<br>
> However, when the entity is a symlink, we would need a second call to `stat`<br>
> to determine the properties of the linked file.<br>
><br>
><br>
> Therefore, "atomic attribute access" is not possible to guarantee.<br>
><br>
><br>
> An implementation may choose to partially populate the caches, omitting<br>
> attributes about the linked entity; or it may choose to fully populate the<br>
> cache non-atomically using a non-atomic series of calls to `lstat` and<br>
> `stat`.<br>
><br>
><br>
> The former case obfuscates from the user which attributes are cached and<br>
> which are not, opening them up to TOCTOU violations.<br>
><br>
><br>
> Worse yet, the latter case, which fully populates the cache non-atomically,<br>
> potentially commits a TOCTOU violation itself, and this violation is hidden<br>
> from the user, preventing them from knowing it exists or being able to do<br>
> anything about it.<br>
><br>
><br>
> Both solutions seem fraught with problems.<br>
><br>
> Conclusion<br>
><br>
> ===============<br>
><br>
><br>
> I hope to have shown the pros and cons of different directory entry caching<br>
> implementations. Now I need your help deciding what's best for Libc++!<br>
><br>
><br>
> Personally, I think the best solution is to revert these changes to the<br>
> standard, since each possible implementation seems broken in one way or<br>
> another. Directory entries should not cache, and directory iterators should<br>
> not attempt to populate the cache incompletely. Afterwards, a better and<br>
> safer proposal can be put forward which to provides efficient and atomic<br>
> access to multiple attributes for the same entity can be proposed.<br>
><br>
><br>
> Unfortunately, it might be to late at this point to revert the paper, even<br>
> if the committee generally agrees with these problems.<br>
><br>
><br>
> Thoughts?<br>
><br>
><br>
><br>
> [1] <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0317r1.html" rel="noreferrer" target="_blank">http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0317r1.html</a><br>
><br>
> [2] <a href="https://en.cppreference.com/w/cpp/experimental/fs/directory_entry" rel="noreferrer" target="_blank">https://en.cppreference.com/w/cpp/experimental/fs/directory_entry</a><br>
><br>
> [3] <a href="https://en.cppreference.com/w/cpp/filesystem/directory_entry" rel="noreferrer" target="_blank">https://en.cppreference.com/w/cpp/filesystem/directory_entry</a><br>
><br>
> [4] <a href="https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use</a><br>
><br>
><br>
><br>
><br>
><br>
> _______________________________________________<br>
> cfe-dev mailing list<br>
> <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
><br>
</blockquote></div></div>