[cfe-dev] [LLVMdev] LLVM & Clang file management

Tue Dec 6 09:28:31 PST 2011

On Tue, Dec 6, 2011 at 6:16 PM, Ruben Van Boxem
<vanboxem.ruben at gmail.com> wrote:
> 2011/12/6 Daniel Dunbar <daniel at zuster.org>
>>
>> On Tue, Dec 6, 2011 at 2:27 AM, Manuel Klimek <klimek at google.com> wrote:
>> > On Tue, Dec 6, 2011 at 2:11 AM, Michael Spencer <bigcheesegs at gmail.com>
>> > wrote:
>> >> On Sun, Dec 4, 2011 at 9:06 AM, Manuel Klimek <klimek at google.com>
>> >> wrote:
>> >>> On Sat, Dec 3, 2011 at 10:33 PM, Douglas Gregor <dgregor at apple.com>
>> >>> wrote:
>> >>>> Hi Manuel,
>> >>>>
>> >>>> On Nov 28, 2011, at 2:49 AM, Manuel Klimek wrote:
>> >>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> while working on tooling on top of clang/llvm we found the file
>> >>>>> system
>> >>>>> abstractions in clang/llvm to be one of the points that could be
>> >>>>> nicer
>> >>>>> to integrate with. I’m writing this mail to propose a strawman and
>> >>>>> get
>> >>>>> some feedback on what you guys think the right way forward is (or
>> >>>>> whether we should just leave things as they are).
>> >>>>>
>> >>>>> First, the FileManager we have in clang has helped us a lot for our
>> >>>>> tooling - when we run clang in a mapreduce we don’t need to lay out
>> >>>>> files on a disk, we can just map files into memory and happily clang
>> >>>>> over them. We’re also using the same mechanism to map builtin
>> >>>>> includes; in short, the FileManager has made it possible to do clang
>> >>>>> at scale.
>> >>>>>
>> >>>>> Now we’re aware that it was not really the intention of the
>> >>>>> FileManager to allow doing the things we do with it: not every
>> >>>>> module
>> >>>>> in clang uses the FileManager, and the moment we hit llvm there is
>> >>>>> no
>> >>>>> FileManager at all. For example, in case of the Driver we hack
>> >>>>> around
>> >>>>> the fact that the header search tries to access the file system
>> >>>>> driectly in rather brittle ways, relying on implementation details
>> >>>>> and
>> >>>>> #ifdefs.
>> >>>>>
>> >>>>> So why not make FileManager a more principled (and still blazing
>> >>>>> fast)
>> >>>>> file system abstraction?
>> >>>>
>> >>>> Yes, please!
>> >>>
>> >>> Great :) /me jumps right into the design discussion then.
>> >>>
>> >>>> Having a proper virtual file system across Clang and LLVM would be a
>> >>>> huge boon, especially for pushing Clang into more applications that aren't
>> >>>> simply "grab stuff from the local file system." The current
>> >>>> FileManager/SourceManager dance used to provide in-memory content for a
>> >>>> (virtual or old) file is quite the mess.
>> >>>>
>> >>>>> Pro:
>> >>>>> - only one interface for developers to learn on the project (no more
>> >>>>> PathV1 vs PathV2 vs FileManager)
>> >>>>> - only one implementation (per-platform) for easier maintenance of
>> >>>>> the
>> >>>>> file system platform abstraction
>> >>>>> - one point to insert synchronization guarantees for tools / IDE
>> >>>>> integration that wants to run clang in multiple threads at once (for
>> >>>>> example when re-indexing on 12-ht-core machines)
>> >>>>> - being able to replay compilations by injecting a virtual file
>> >>>>> system
>> >>>>> that exactly “copies” the original file system’s content, which
>> >>>>> allows
>> >>>>> easy scaling of replays, running tools against dirty edit buffers on
>> >>>>> a
>> >>>>> lower level than the SourceManager and unit testing
>> >>>>
>> >>>> … and making sure that all of the various stages of compilation see
>> >>>> the same view of the file system.
>> >>>>
>> >>>>> Con:
>> >>>>> - there would be yet another try at unifying the APIs which would be
>> >>>>> in an intermediate state while being worked on (and PathV1 vs PathV2
>> >>>>> is already bad enough)
>> >>>>
>> >>>> I'm fine with intermediate states so long as the direction and
>> >>>> benefits are clear. The former we can certainly discuss, and the latter is
>> >>>> obvious already.
>> >>>>
>> >>>>> - making it the canonical file system interface is a lot of effort
>> >>>>> that requires touching a lot of systems (while we’re volunteering to
>> >>>>> do the work, it will probably eat up other people’s time, too)
>> >>>>
>> >>>> I doubt I'll have much time to directly hack on this, but I'll be
>> >>>> happy to review / discuss / help with adoption. libclang is one of the huge
>> >>>> beneficiaries of such a change, so I care a lot about getting that to work
>> >>>> well.
>> >>>>
>> >>>>> What parts (if any) of this type of transition makes sense?
>> >>>>> 1. Figure out the “correct” interface we’d want for FileManager to
>> >>>>> be
>> >>>>> more generally useful
>> >>>>> 2. Change FileManager to that interface
>> >>>>> 4. Sink FileManager into llvm, so it can be used by other projects
>> >>>>> 4. Use it throughout clang
>> >>>>> 5. Use it throughout llvm
>> >>>>> We don’t need to do all of them at once, and should be able to
>> >>>>> evaluate the results along the way.
>> >>>>
>> >>>> I share some of Daniel's concern about re-using FileManager, because
>> >>>> the interface is very narrowly designed for Clang's usage and some of the
>> >>>> functionality intended for the VFS is split out into SourceManager. My
>> >>>> advice would be to start building a new VFS down in LLVM, and make
>> >>>> FileManager an increasingly-shrinking interface on top of the new VFS. At
>> >>>> some point, FileManager will be thin enough that its clients can just switch
>> >>>> directly over to using the VFS, and FileManager can eventually go away.
>> >>>>
>> >>>> I do realize that this could end up like PathV1 vs. PathV2, where
>> >>>> both exist for a while, but the benefits of the VFS should outweigh our
>> >>>> collective laziness.
>> >>>
>> >>> So, as I noted in my replay to Daniel, after working through
>> >>> llvm/Support (and bringing FileManager back to my mind) I think I'm
>> >>> actually seeing a way forward, tell me if I'm crazy:
>> >>> 1. morph FileSystem (I don't know whether that would include PathV2,
>> >>> but I currently don't think so) into a class that exports a nice
>> >>> interface for all FileSystem functions that we can override; to be
>> >>> able to do that step-by-step, we could for example introduce a static
>> >>> FileSystem pointer that is initialized with the default system file
>> >>> system on startup (I like being able to do baby-steps)
>> >>> 2. add methods to FileSystem to support opening MemoryBuffers; the
>> >>> path forward will be to move all calls to MemofyBuffer::get*File
>> >>> through the FileSystem interface, but again that can be handled
>> >>> incrementally
>> >>> 3. at that point we'd have enough stuff in FileSystem to rebase
>> >>> FileManager on top of it; once 1 and 2 are finished for clang/.* we'll
>> >>> be able to completely move the virtual file support over into a nice
>> >>> OverlayFileSystem implementation (argh, I've coded too many of those
>> >>> in my life);
>> >>> 4. add methods to FileSystem to support opening raw_fd_ostreams; this
>> >>> is basically the process for reading mirrored
>> >>>
>> >>> Thoughts? Completely broken approach? Broken order?
>> >>>
>> >>> On a different note, switching to the SourceManager topic - I know
>> >>> enough about SourceManager to be dangerous but not enough to ever
>> >>> claim I would have understood the crazy buffer management that's going
>> >>> on in ContentCache :) So I'd need a lot of help to pry that box open
>> >>> eventually. Currently I'd think that this can be done in a subsequent
>> >>> step after the file system is sorted out, but I might be wrong...
>> >>>
>> >>> Cheers,
>> >>> /Manuel
>> >>
>> >> Just for some background about why we have PathV2.
>> >>
>> >> In my quest to improve Windows support across LLVM and Clang I ran
>> >> into many issues with the way PathV1 worked. A few were:
>> >> * PathV1, and most of LLVM, use std::string to handle errors. This
>> >> makes code more verbose than needed, and loses os level error
>> >> information.
>> >> * PathV1 makes it difficult to handle Unicode on Windows. Although
>> >> apparently I didn't solve the problem correctly either :P.
>> >
>> > Are there open bugs? A quick search for unicode on llvm.org/bugs
>> > didn't show anything windows specific.
>> >
>> >> * PathV1 requires constructing a Path object before calling any
>> >> functions. This is inefficient when most of the time you have
>> >> something StringRef'able.
>> >>
>> >> Thus when I designed PathV2 I made it stateless, utf-8 only, and used
>> >> error_code.
>> >>
>> >> The reason I bring this up is because I support a VFS, however, I want
>> >> to make sure that we keep in mind the reasons PathV2 was created while
>> >> writing it.
>> >
>> > Yep, that's an important point. As I said, I've looked into PathV2 and
>> > I really like the distinction between path manipulation and file
>> > system access, and the general design of both PathV2 and
>> > Support/FileSystem.
>> >
>> >> PathV1 -> PathV2 transition stopped because I ran out of time to do
>> >> it. There's so much code that uses it, and some of the changes are non
>> >> trivial in the cases where the Path class is stored and accessed many
>> >> places instead of just used to access the path functions.
>> >>
>> >> The approach and order seems good to me. The llvm::sys::path parts can
>> >> stay separate, only the llvm::sys::fs parts need to be virtualized.
>> >
>> > Yep, that was exactly my thought. Thanks for confirming and providing
>> > all the background information! :)
>>
>> Not sure if I follow here, but we will need to do some amount of work
>> on sys::path (not virtualization per se).
>>
>> The current PathV2 API has embedded into it an assumption of working
>> with the native path type.
>>
>> In the system I was originally imagining, we would have something like:
>>  (1) sys::path::unix and sys::path::windows. So client code can use a
>> windows specific version if it wanted to for some reason. These would
>> have the same functions in them.
>
>
> Worst. Idea. Ever. Sorry to be blunt, but how does that help higher-level
> code at all? The underlying implementation (type) of the path objects could
> be different, but the API itself should really be platform-independent. No
> use for a Windows path in a Unix app, and if so, it's not LLVM's place to
> provide that unneeded functionality.
>
>>  (2) sys::path, which would just be the same as one of the two
>> previous namespaces, selected to match the host.
>
>
> This is better. LLVM/Clang shouldn't know what a sys::path is (UTF8 Unix
> path or UTF16 Windows path), they just need a "path".
>
>>
>>  (3) some other path variants, which would take a FileSystem object,
>> and then call the appropriate path functions for the FileSystem type.
>>
>> Eventually, any code which we want to be virtualizable would need to
>> move to not using the sys::path functions that don't take a FileSystem
>> object.
>
>
> You guys probably know more than I do about LLVM/Clang's needs wrt a file
> system cache, but apart from that, why not model according to the Boost
> implementation? Futureproof and *proven useful*.

As far as I understand that's the case currently with Support/PathV2
and Support/FileSystem.
The point of the exercise is not to radically change the interface
(the interface looks fine and makes sense),
but to virtualize it.

Cheers,
/Manuel