[LLVMdev] [cfe-dev] LLVM & Clang file management

Tue Dec 6 09:16:54 PST 2011

2011/12/6 Daniel Dunbar <daniel at zuster.org>

> On Tue, Dec 6, 2011 at 2:27 AM, Manuel Klimek <klimek at google.com> wrote:
> > On Tue, Dec 6, 2011 at 2:11 AM, Michael Spencer <bigcheesegs at gmail.com>
> wrote:
> >> On Sun, Dec 4, 2011 at 9:06 AM, Manuel Klimek <klimek at google.com>
> wrote:
> >>> On Sat, Dec 3, 2011 at 10:33 PM, Douglas Gregor <dgregor at apple.com>
> wrote:
> >>>> Hi Manuel,
> >>>>
> >>>> On Nov 28, 2011, at 2:49 AM, Manuel Klimek wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> while working on tooling on top of clang/llvm we found the file
> system
> >>>>> abstractions in clang/llvm to be one of the points that could be
> nicer
> >>>>> to integrate with. I’m writing this mail to propose a strawman and
> get
> >>>>> some feedback on what you guys think the right way forward is (or
> >>>>> whether we should just leave things as they are).
> >>>>>
> >>>>> First, the FileManager we have in clang has helped us a lot for our
> >>>>> tooling - when we run clang in a mapreduce we don’t need to lay out
> >>>>> files on a disk, we can just map files into memory and happily clang
> >>>>> over them. We’re also using the same mechanism to map builtin
> >>>>> includes; in short, the FileManager has made it possible to do clang
> >>>>> at scale.
> >>>>>
> >>>>> Now we’re aware that it was not really the intention of the
> >>>>> FileManager to allow doing the things we do with it: not every module
> >>>>> in clang uses the FileManager, and the moment we hit llvm there is no
> >>>>> FileManager at all. For example, in case of the Driver we hack around
> >>>>> the fact that the header search tries to access the file system
> >>>>> driectly in rather brittle ways, relying on implementation details
> and
> >>>>> #ifdefs.
> >>>>>
> >>>>> So why not make FileManager a more principled (and still blazing
> fast)
> >>>>> file system abstraction?
> >>>>
> >>>> Yes, please!
> >>>
> >>> Great :) /me jumps right into the design discussion then.
> >>>
> >>>> Having a proper virtual file system across Clang and LLVM would be a
> huge boon, especially for pushing Clang into more applications that aren't
> simply "grab stuff from the local file system." The current
> FileManager/SourceManager dance used to provide in-memory content for a
> (virtual or old) file is quite the mess.
> >>>>
> >>>>> Pro:
> >>>>> - only one interface for developers to learn on the project (no more
> >>>>> PathV1 vs PathV2 vs FileManager)
> >>>>> - only one implementation (per-platform) for easier maintenance of
> the
> >>>>> file system platform abstraction
> >>>>> - one point to insert synchronization guarantees for tools / IDE
> >>>>> integration that wants to run clang in multiple threads at once (for
> >>>>> example when re-indexing on 12-ht-core machines)
> >>>>> - being able to replay compilations by injecting a virtual file
> system
> >>>>> that exactly “copies” the original file system’s content, which
> allows
> >>>>> easy scaling of replays, running tools against dirty edit buffers on
> a
> >>>>> lower level than the SourceManager and unit testing
> >>>>
> >>>> … and making sure that all of the various stages of compilation see
> the same view of the file system.
> >>>>
> >>>>> Con:
> >>>>> - there would be yet another try at unifying the APIs which would be
> >>>>> in an intermediate state while being worked on (and PathV1 vs PathV2
> >>>>> is already bad enough)
> >>>>
> >>>> I'm fine with intermediate states so long as the direction and
> benefits are clear. The former we can certainly discuss, and the latter is
> obvious already.
> >>>>
> >>>>> - making it the canonical file system interface is a lot of effort
> >>>>> that requires touching a lot of systems (while we’re volunteering to
> >>>>> do the work, it will probably eat up other people’s time, too)
> >>>>
> >>>> I doubt I'll have much time to directly hack on this, but I'll be
> happy to review / discuss / help with adoption. libclang is one of the huge
> beneficiaries of such a change, so I care a lot about getting that to work
> well.
> >>>>
> >>>>> What parts (if any) of this type of transition makes sense?
> >>>>> 1. Figure out the “correct” interface we’d want for FileManager to be
> >>>>> more generally useful
> >>>>> 2. Change FileManager to that interface
> >>>>> 4. Sink FileManager into llvm, so it can be used by other projects
> >>>>> 4. Use it throughout clang
> >>>>> 5. Use it throughout llvm
> >>>>> We don’t need to do all of them at once, and should be able to
> >>>>> evaluate the results along the way.
> >>>>
> >>>> I share some of Daniel's concern about re-using FileManager, because
> the interface is very narrowly designed for Clang's usage and some of the
> functionality intended for the VFS is split out into SourceManager. My
> advice would be to start building a new VFS down in LLVM, and make
> FileManager an increasingly-shrinking interface on top of the new VFS. At
> some point, FileManager will be thin enough that its clients can just
> switch directly over to using the VFS, and FileManager can eventually go
> away.
> >>>>
> >>>> I do realize that this could end up like PathV1 vs. PathV2, where
> both exist for a while, but the benefits of the VFS should outweigh our
> collective laziness.
> >>>
> >>> So, as I noted in my replay to Daniel, after working through
> >>> llvm/Support (and bringing FileManager back to my mind) I think I'm
> >>> actually seeing a way forward, tell me if I'm crazy:
> >>> 1. morph FileSystem (I don't know whether that would include PathV2,
> >>> but I currently don't think so) into a class that exports a nice
> >>> interface for all FileSystem functions that we can override; to be
> >>> able to do that step-by-step, we could for example introduce a static
> >>> FileSystem pointer that is initialized with the default system file
> >>> system on startup (I like being able to do baby-steps)
> >>> 2. add methods to FileSystem to support opening MemoryBuffers; the
> >>> path forward will be to move all calls to MemofyBuffer::get*File
> >>> through the FileSystem interface, but again that can be handled
> >>> incrementally
> >>> 3. at that point we'd have enough stuff in FileSystem to rebase
> >>> FileManager on top of it; once 1 and 2 are finished for clang/.* we'll
> >>> be able to completely move the virtual file support over into a nice
> >>> OverlayFileSystem implementation (argh, I've coded too many of those
> >>> in my life);
> >>> 4. add methods to FileSystem to support opening raw_fd_ostreams; this
> >>> is basically the process for reading mirrored
> >>>
> >>> Thoughts? Completely broken approach? Broken order?
> >>>
> >>> On a different note, switching to the SourceManager topic - I know
> >>> enough about SourceManager to be dangerous but not enough to ever
> >>> claim I would have understood the crazy buffer management that's going
> >>> on in ContentCache :) So I'd need a lot of help to pry that box open
> >>> eventually. Currently I'd think that this can be done in a subsequent
> >>> step after the file system is sorted out, but I might be wrong...
> >>>
> >>> Cheers,
> >>> /Manuel
> >>
> >> Just for some background about why we have PathV2.
> >>
> >> In my quest to improve Windows support across LLVM and Clang I ran
> >> into many issues with the way PathV1 worked. A few were:
> >> * PathV1, and most of LLVM, use std::string to handle errors. This
> >> makes code more verbose than needed, and loses os level error
> >> information.
> >> * PathV1 makes it difficult to handle Unicode on Windows. Although
> >> apparently I didn't solve the problem correctly either :P.
> >
> > Are there open bugs? A quick search for unicode on llvm.org/bugs
> > didn't show anything windows specific.
> >
> >> * PathV1 requires constructing a Path object before calling any
> >> functions. This is inefficient when most of the time you have
> >> something StringRef'able.
> >>
> >> Thus when I designed PathV2 I made it stateless, utf-8 only, and used
> >> error_code.
> >>
> >> The reason I bring this up is because I support a VFS, however, I want
> >> to make sure that we keep in mind the reasons PathV2 was created while
> >> writing it.
> >
> > Yep, that's an important point. As I said, I've looked into PathV2 and
> > I really like the distinction between path manipulation and file
> > system access, and the general design of both PathV2 and
> > Support/FileSystem.
> >
> >> PathV1 -> PathV2 transition stopped because I ran out of time to do
> >> it. There's so much code that uses it, and some of the changes are non
> >> trivial in the cases where the Path class is stored and accessed many
> >> places instead of just used to access the path functions.
> >>
> >> The approach and order seems good to me. The llvm::sys::path parts can
> >> stay separate, only the llvm::sys::fs parts need to be virtualized.
> >
> > Yep, that was exactly my thought. Thanks for confirming and providing
> > all the background information! :)
>
> Not sure if I follow here, but we will need to do some amount of work
> on sys::path (not virtualization per se).
>
> The current PathV2 API has embedded into it an assumption of working
> with the native path type.
>
> In the system I was originally imagining, we would have something like:
>  (1) sys::path::unix and sys::path::windows. So client code can use a
> windows specific version if it wanted to for some reason. These would
> have the same functions in them.
>

Worst. Idea. Ever. Sorry to be blunt, but how does that help higher-level
code at all? The underlying implementation (type) of the path objects could
be different, but the API itself should really be platform-independent. No
use for a Windows path in a Unix app, and if so, it's not LLVM's place to
provide that unneeded functionality.

 (2) sys::path, which would just be the same as one of the two
> previous namespaces, selected to match the host.
>

This is better. LLVM/Clang shouldn't know what a sys::path is (UTF8 Unix
path or UTF16 Windows path), they just need a "path".

>  (3) some other path variants, which would take a FileSystem object,
> and then call the appropriate path functions for the FileSystem type.
>
> Eventually, any code which we want to be virtualizable would need to
> move to not using the sys::path functions that don't take a FileSystem
> object.
>

You guys probably know more than I do about LLVM/Clang's needs wrt a file
system cache, but apart from that, why not model according to the Boost
implementation? Futureproof and *proven useful*.

Ruben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111206/ba489485/attachment.html>