[PATCH] Allow multiple modules with the same name to coexist in the module cache

Tue Apr 1 17:10:24 PDT 2014

On Tue, Apr 1, 2014 at 4:37 PM, Ben Langmuir <blangmuir at apple.com> wrote:

>
> On Apr 1, 2014, at 3:28 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>
> On Tue, Apr 1, 2014 at 12:43 PM, Ben Langmuir <blangmuir at apple.com> wrote:
>
>>
>> On Mar 31, 2014, at 7:59 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>>
>> Sorry, I think we're getting off-track here, my bad. I have a couple of
>> in-line comments then I'm going to try to pull this discussion together at
>> the bottom...
>>
>> On Mon, Mar 31, 2014 at 4:08 PM, Ben Langmuir <blangmuir at apple.com>
>> wrote:
>>
>>>
>>> On Mar 31, 2014, at 3:24 PM, Richard Smith <richard at metafoo.co.uk>
>>> wrote:
>>>
>>> On Mon, Mar 31, 2014 at 11:40 AM, Ben Langmuir <blangmuir at apple.com>
>>> wrote:
>>>
>>>>
>>>> On Mar 28, 2014, at 4:42 PM, Richard Smith <richard at metafoo.co.uk>
>>>> wrote:
>>>>
>>>> On Fri, Mar 28, 2014 at 3:27 PM, Ben Langmuir <blangmuir at apple.com>
>>>> wrote:
>>>>
>>>>> Hi Richard,
>>>>>
>>>>>
>>>>> On Mar 28, 2014, at 2:45 PM, Richard Smith <richard at metafoo.co.uk>
>>>>> wrote:
>>>>>
>>>>> On Fri, Mar 28, 2014 at 1:12 PM, Ben Langmuir <blangmuir at apple.com>
>>>>> wrote:
>>>>>
>>>>>> This patch allows multiple modules that have the same name to coexist
>>>>>> in the module cache.  To differentiate between two modules with the same
>>>>>> name, we will consider the path the module map file that they are defined
>>>>>> by* part of the 'key' for looking up the precompiled module (pcm file).
>>>>>>  Specifically, this patch renames the precompiled module (pcm) files from
>>>>>>
>>>>>> cache-path/<module hash>/Foo.pcm
>>>>>>
>>>>>> to
>>>>>>
>>>>>> cache-path/<module hash>/Foo-<hash of module map path>.pcm
>>>>>>
>>>>>
>>>>> From a high level, I don't really see why we need a second hash here.
>>>>> Shouldn't the -I options be included in the <module hash>? If I build the
>>>>> same module with different -I flags, that should resolve to different .pcm
>>>>> files, regardless of whether it makes the module name resolve to a
>>>>> different module.map file.
>>>>>
>>>>>
>>>>> If we include the -I options in the module hash, we will explode the
>>>>> number of module compilations needed.  The following should all be able to
>>>>> share a module 'A'.
>>>>>
>>>>
>>>> I really don't think they should, if that second -I path is used in any
>>>> way when building module 'A'.
>>>>
>>>>
>>>> clang -fmodules -I /path/to/A -I /path/to/B some_file.c
>>>>> clang -fmodules -I /path/to/A -I /path/to/C some_file2.c
>>>>> clang -fmodules -I /path/to/A -I /path/to/D some_file3.c
>>>>>
>>>>
>>>> I would think a better solution here would be to not have that second
>>>> -I path in header search when building module A (and thus not include it in
>>>> the hash for a principled reason).
>>>>
>>>> Ultimately, each module should have its own header search path -- the
>>>> model of one set of include paths for the entire TU (with that include path
>>>> potentially causing some libraries to find the wrong files) is a broken,
>>>> antiquated legacy of the non-modules world.
>>>>
>>>>
>>>> I agree with this in principle - having a predictable context for
>>>> building the modules would solve a lot of problems.  However, that seems to
>>>> only work for semantic import (@import, not #include), which is the less
>>>> common case for us at the moment.
>>>>
>>>
>>> Can you explain why? Suppose every module.map file lists the include
>>> paths for that module -- what problems does that create? (Is this problem
>>> specific to us generating implicit module.map files for frameworks?)
>>>
>>>
>>> If I write #include, or #import then I expect it to obey the include
>>> paths I set, since that is how it always worked before (for better or
>>> worse). For @import, and whatever other new syntax we introduce, I'm fully
>>> in favour of isolating the module build from the happenstance of the
>>> command line and using a header search derived from the module map file.
>>>
>>
>> I think it's reasonable that adding a module.map file affects the
>> semantics of a #include uses the module. In the case where someone has
>> written a module map that specifies an include search path, it seems
>> natural that that might (and should!) change the behavior of a #include
>> mapped to that module.
>>
>>
>> Conversely, I think it'd be surprising for a module build to pick up
>> headers from some directory in the user of the module's code, whether I
>> used @import or #include syntax to import the module. This is legacy
>> brokenness from the pre-modules era that I think we should not bring with
>> us.
>>
>>
>> In the short term, we should probably drop all header search paths that
>>>> are before the path in which the module map was found.
>>>>
>>>>
>>>> To be clear, this has the same correctness problem as my approach - the
>>>> header search paths that come before the module map may still change the
>>>> content of the module, especially if the module depends on other modules.
>>>>
>>>
>>> What I'm suggesting is:
>>>
>>> 1) Drop the -I paths that are earlier than the module in the header
>>> search path when building the module
>>> 2) Include the rest of the header search paths in the configuration hash
>>> for the module
>>>
>>> I don't see how this still leads to collisions (other than the madness
>>> that results from having multiple module.maps on your header search path
>>> that define the same module, where the order in which we happen to load
>>> them affects our semantics).
>>>
>>>
>>> It doesn't introduce collisions, but it may change the actual content of
>>> the module if the module includes headers that are affected by those
>>> include paths.
>>>
>>
>> I agree, but as noted above, I consider that to be a bug fix. No-one
>> writes libraries that expect the library user to provide headers for them
>> to include. If they did, I feel happy telling them that their library is
>> non-modular and needs to be fixed before it can be used with modules. Can
>> you give an example of a situation where you imagine this being an issue?
>>
>>
>> In general, adding the -I options is brittle when precompiled headers are
>>>> involved. Since you likely don't have exactly the same -I options when
>>>> building your PCH as when you use it (you likely have fewer -I paths when
>>>> building the PCH if it is being widely used) you could potentially have
>>>> different module hashes in the PCH and in the main file, which leads to
>>>> trying to load multiple copies of a module from different paths.  I ran
>>>> into this when I tried to naively add all of the -I options to the module
>>>> hash.  The heuristic you suggested for dropping the earlier paths might
>>>> help here, but it would be required for correctness, not merely as an
>>>> optimization.
>>>>
>>>
>>> I'll need to think about this one some more. My initial reaction is that
>>> if a PCH loads a module in one configuration and a user of the PCH tries to
>>> load the same module in a different configuration (because it's being built
>>> with different include paths), that should be a hard error. It seems to me
>>> that this is a separate issue from ensuring that we don't have collisions
>>> in the cache?
>>>
>>>
>>> This patch does make it an error if the user of the PCH does not provide
>>> a header search context in which  the modules resolve the same way as when
>>> the pch was built.  I don't think expecting the PCH to be created with
>>> exactly the same -I options is valuable, because it prevents legitimate
>>> reuse.
>>>
>>
>> OK. I think that's great as far as it goes, but it doesn't solve the
>> whole problem, because non-modular #includes can still resolve in different
>> ways.
>>
>> I am also not sure what effect this would have on the global module
>>>> index, since the set of modules to load would span multiple
>>>> hash-directories and we would need to avoid looking at incompatible
>>>> modules.  I'm not sure if this is a problem or not.
>>>>
>>>
>>> Yes, that's an interesting problem =)
>>>
>>>
>>>> Any thoughts?
>>>>
>>>
>>> I don't think I've got enough information to suggest something concrete
>>> yet, but this strikes me as a possible avenue to explore:
>>>
>>> 1) Teach module.map files to specify the include paths for the module
>>> 2) When implicitly generating a module.map file, inject the include
>>> paths from the host build (probably skipping ones prior to the directory
>>> containing the framework)
>>> 3) Include a hash of the contents of the implicit module.map file in the
>>> cache entry for any module that has an implicit module.map
>>> 4) Don't include the header search paths in the configuration hash,
>>> since they're not used when determining how to build a module any more.
>>>
>>>
>>> Does this help us with multiple modules having the same name?  I think
>>> we would still have the problem that when we go to load a pcm file it may
>>> have been built from a different module.map file than the one we found in
>>> this compilation (e.g. I have a module MyFoo, and so does another user on
>>> the system and we happen to share a module cache directory).  According to
>>> 4 we don't hash the -I paths, so these modules would (potentially) resolve
>>> to the same location on disk.
>>>
>>> This is convincing me more than ever that  regardless of what else we
>>> do, we need to store the path to the module map file with the pcm so that
>>> we can verify that the module we are trying to import resolves to the same
>>> location that it did when it was built. To me, the filename of the pcm is
>>> the most obvious place for this information, since that allows us to
>>> restrict ourselves to one module cache directory (ModuleCache/<module
>>> hash>) just like now, and it allows modules with the same name to live in
>>> the same cache without trampling each other or the possibility of
>>> accidentally loading the wrong one.
>>>
>>
>> I'd like to take a step back at this point and summarize what I believe
>> the state to be (sorry that this is rather long). We have three pieces of
>> data that between them uniquely identify the contents of a module:
>>
>>
>> Thanks for the summary :)
>>
>>
>> Build environment configuration:
>> --------------------------------
>>
>>   This covers things like the language options (and, currently,
>> configuration macros, though those probably belong in the module identity)
>>   We have one global module index per configuration, and we would like
>> each compilation action to only touch a single configuration.
>>   We want to reuse the same build environment configuration as much as
>> possible, to maximize cache hits.
>>   A hash of this data is used as the subdirectory in the module cache.
>>
>> Module identity:
>> ----------------
>>
>>   This covers things like the name of the module and the module.map file
>> in which it was defined.
>>   The proposed patch includes a hash of this data in the name of the .pcm
>> file.
>>
>> Module header:
>> --------------
>>
>>   This covers things that we discovered when building the module, such as
>> the module's dependencies and their timestamps.
>>   The proposed patch adds the list of imported modules to this, such that
>> header search must find the same module.
>>
>>   If we have a mismatch here, we rebuild the module.
>>
>> So far, this division seems entirely reasonable to me. Observations:
>>  * We only want to use one global index per build
>>  * We want to minimize the amount of irrelevant stuff in the global index
>> for efficiency reasons (otherwise we could just have one global index for
>> all configurations), and
>>  * We want to minimize the number of times we rebuild modules
>>
>> From the above, I think that we should use the following guidelines:
>>  * For correctness, the (build environment + module identity + module
>> header) must uniquely identify the module contents (because that's how we
>> determine whether we have a cache hit).
>>  * We want to pick the data that goes into the build environment such
>> that two module builds have the same build environment configuration if
>> they could conceivably be used in the same TU, and we should try to keep
>> their build environments different otherwise.
>>  * We only want to include things in the module header if changes to
>> those things mean we'll probably never want to use the old .pcm file again.
>>
>>
>> As long as the list of imported modules lives in the module header, I
>> don't think this last guideline makes sense.  For example, suppose I have a
>> library A that depends on library B.  I also have two users of library A,
>> and they each require a different version of B.  Or like the case above, I
>> am the developer of B and want to test against A using my modified version.
>>
>
> What I'm suggesting is that we should strive to put that data somewhere
> other than the module header, since putting it in the module header will
> cause unnecessary module rebuilds. If the set of module imports changes,
> the optimal course of action depends on *why* it changes: if it's due to a
> filesystem difference, we should rebuild the module (since the old pcm is
> probably not useful any more), but if it's due to an include path change,
> we should probably keep the existing .pcm file, since we might want to use
> it again.
>
> But that's only the optimal course of action, and we could certainly do
> something locally suboptimal here if it makes other factors better.
>
> The main point of contention seems to be how header search paths fit into
>> the above taxonomy. I think we've discussed a few options here:
>>
>> 1) They are a property of the module, and are independent of the header
>> search path of the module user. This means they are not part of the above
>> data.
>>
>> 2) They are a property of the build environment. (This was my original
>> position.) This seems to mean that the module identity can be just the
>> module name, if we enforce that the same module name can't be found
>> multiple times in the include path.
>>
>>
>> There is a hole here that Argyrios pointed out in an earlier email.  You
>> can add a new module map file to a path already in the search, and that may
>> change the module identity without changing its name or configuration.  Now
>> if we are enforcing that the module only be found once, you would
>> presumably have to remove/modify the module map it was originally found in.
>>  But to be 100% correct we would still need to include the module map file
>> in the module identity.
>>
>
> Yes, I agree, and I think you've fully convinced me that we should
> re-resolve of imported modules on module load (even if the include path is
> unchanged).
>
> In fact, you've convinced me that all the changes your patch makes are
> going in the right direction, so consider this an LGTM on the direction of
> the patch. Thanks for your perseverance! I'd still like to keep hashing out
> the details here, because I think we've not yet reached consensus on
> whether your patch is a complete solution or just a partial one.
>
>
> Great :) Do you also agree with using the module header to store and then
> re-resolve the imports or is that still contentious?
>

That seems like the right approach for dealing with the possibility of a
module map earlier in the include path introducing a module with the same
name (even if we also wanted some other mechanism for dealing with changes
to the include path).

> 3) They are a property of the module identity.
>>
>> 4) They're not part of anything. The names and resolved paths of imported
>> modules are included in the module header. We hope that the include path
>> didn't affect anything else. (This is what the patch we're discussing does.)
>>
>>
>> Option 4 seems to be only a partial solution to the include path issue,
>> because we still have cache collisions between modules with different
>> contents due to differing -I paths. Your suggested flag to make non-modular
>> #includes inside a module an error would fix this, but I don't think that
>> is going to work for much real-world code -- I don't think that non-modular
>> #includes are going away.
>>
>>
>> I agree this won't work for all code, but if the standard library is
>> modularized I think it does cover a lot of cases.
>>
>>
>> I think we can fix option 4 by also including the set of files #included
>> by non-modular includes in the module header (and anything else that
>> depends on the include paths), and re-resolving those files too when we
>> import a module. If the lookup differs, we rebuild the module. This has the
>> unfortunate effect that we might rebuild a module when the prior .pcm is
>> still useful, but that seems to be a rare case. I could probably live with
>> that. Does that sound OK to you?
>>
>>
>> When we discussed this internally, we felt that it would be expensive
>> (potentially *lots* of header search lookups), and also hairy to implement
>> (you need not only the #include spelling, but also the "includes" context -
>> maybe more?).
>>
>
> OK, but if we don't do it, then we can get bogus hits in the module cache.
> Perhaps those are rare enough that we don't need to worry, but they don't
> seem fundamentally different from the cache hits you're trying to avoid by
> re-resolving modules.
>
>
> Right, I'm not trying to solve this with my patch.  Maybe re-resolving the
> non-modular includes is the best we can do for modules that don't have
> their own search context and will not abide by the "it's modules all the
> way down" rule.
>

That sounds like it might be a good tradeoff. And as you say, it's not
something that this patch needs to deal with.

Another question is what include paths we should use for a module:
>>
>> 1) They inherit all the include paths from the host build. (We keep the
>> legacy broken model.)
>> 2) They inherit the include paths for themselves and anything later in
>> the header search path.
>> 3) They get to specify their own include path.
>>
>> (1) is the status quo. I suggest we allow (3) and fall back to (2) if no
>> include path is specified.
>>
>>
>> So what exactly does (3) entail? Does it provide the *entire* search
>> context, even for resolving dependent modules?  If so, we are making the
>> module map dependent on the system it is installed on.  If not, then we
>> still need to resolve module names during build to ensure they haven't
>> changed, regardless of the mechanism we use to find them.
>>
>
> I think a module should be able to isolate itself from the header search
> configuration of its users entirely, so I'd expect this to be the entire
> search context. Another thing we've been discussing internally is the
> ability for module map files to contain references to other module map
> files (so library X can say "I depend on module Y, whose module map is over
> there"), which largely removes the need for include paths.
>
>
> I'm also worried that (2) is so close to (1) that it will subtly break
>> users' expectations.
>>
>
> I think that's a fair concern.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20140401/d3adba9a/attachment.html>