[PATCH] Allow multiple modules with the same name to coexist in the module cache

Tue Apr 1 16:37:18 PDT 2014

On Apr 1, 2014, at 3:28 PM, Richard Smith <richard at metafoo.co.uk> wrote:

> On Tue, Apr 1, 2014 at 12:43 PM, Ben Langmuir <blangmuir at apple.com> wrote:
> 
> On Mar 31, 2014, at 7:59 PM, Richard Smith <richard at metafoo.co.uk> wrote:
> 
>> Sorry, I think we're getting off-track here, my bad. I have a couple of in-line comments then I'm going to try to pull this discussion together at the bottom...
>> 
>> On Mon, Mar 31, 2014 at 4:08 PM, Ben Langmuir <blangmuir at apple.com> wrote:
>> 
>> On Mar 31, 2014, at 3:24 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>> 
>>> On Mon, Mar 31, 2014 at 11:40 AM, Ben Langmuir <blangmuir at apple.com> wrote:
>>> 
>>> On Mar 28, 2014, at 4:42 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>>> 
>>>> On Fri, Mar 28, 2014 at 3:27 PM, Ben Langmuir <blangmuir at apple.com> wrote:
>>>> Hi Richard,
>>>> 
>>>> 
>>>> On Mar 28, 2014, at 2:45 PM, Richard Smith <richard at metafoo.co.uk> wrote:
>>>> 
>>>>> On Fri, Mar 28, 2014 at 1:12 PM, Ben Langmuir <blangmuir at apple.com> wrote:
>>>>> This patch allows multiple modules that have the same name to coexist in the module cache.  To differentiate between two modules with the same name, we will consider the path the module map file that they are defined by* part of the ‘key’ for looking up the precompiled module (pcm file).  Specifically, this patch renames the precompiled module (pcm) files from
>>>>> 
>>>>> cache-path/<module hash>/Foo.pcm
>>>>> 
>>>>> to
>>>>> 
>>>>> cache-path/<module hash>/Foo-<hash of module map path>.pcm
>>>>> 
>>>>> From a high level, I don't really see why we need a second hash here. Shouldn't the -I options be included in the <module hash>? If I build the same module with different -I flags, that should resolve to different .pcm files, regardless of whether it makes the module name resolve to a different module.map file.
>>>> 
>>>> If we include the -I options in the module hash, we will explode the number of module compilations needed.  The following should all be able to share a module ‘A’.
>>>> 
>>>> I really don't think they should, if that second -I path is used in any way when building module 'A’.
>>>> 
>>>> clang -fmodules -I /path/to/A -I /path/to/B some_file.c
>>>> clang -fmodules -I /path/to/A -I /path/to/C some_file2.c
>>>> clang -fmodules -I /path/to/A -I /path/to/D some_file3.c
>>>> 
>>>> I would think a better solution here would be to not have that second -I path in header search when building module A (and thus not include it in the hash for a principled reason).
>>>> 
>>>> Ultimately, each module should have its own header search path -- the model of one set of include paths for the entire TU (with that include path potentially causing some libraries to find the wrong files) is a broken, antiquated legacy of the non-modules world. 
>>> 
>>> I agree with this in principle - having a predictable context for building the modules would solve a lot of problems.  However, that seems to only work for semantic import (@import, not #include), which is the less common case for us at the moment.  
>>> 
>>> Can you explain why? Suppose every module.map file lists the include paths for that module -- what problems does that create? (Is this problem specific to us generating implicit module.map files for frameworks?)
>> 
>> If I write #include, or #import then I expect it to obey the include paths I set, since that is how it always worked before (for better or worse). For @import, and whatever other new syntax we introduce, I’m fully in favour of isolating the module build from the happenstance of the command line and using a header search derived from the module map file.
>> 
>> I think it's reasonable that adding a module.map file affects the semantics of a #include uses the module. In the case where someone has written a module map that specifies an include search path, it seems natural that that might (and should!) change the behavior of a #include mapped to that module.  
>> 
>> Conversely, I think it'd be surprising for a module build to pick up headers from some directory in the user of the module's code, whether I used @import or #include syntax to import the module. This is legacy brokenness from the pre-modules era that I think we should not bring with us.
>> 
>>>> In the short term, we should probably drop all header search paths that are before the path in which the module map was found.
>>> 
>>> To be clear, this has the same correctness problem as my approach - the header search paths that come before the module map may still change the content of the module, especially if the module depends on other modules.
>>> 
>>> What I'm suggesting is:
>>> 
>>> 1) Drop the -I paths that are earlier than the module in the header search path when building the module
>>> 2) Include the rest of the header search paths in the configuration hash for the module
>>> 
>>> I don't see how this still leads to collisions (other than the madness that results from having multiple module.maps on your header search path that define the same module, where the order in which we happen to load them affects our semantics).
>> 
>> It doesn’t introduce collisions, but it may change the actual content of the module if the module includes headers that are affected by those include paths.
>> 
>> I agree, but as noted above, I consider that to be a bug fix. No-one writes libraries that expect the library user to provide headers for them to include. If they did, I feel happy telling them that their library is non-modular and needs to be fixed before it can be used with modules. Can you give an example of a situation where you imagine this being an issue?
>> 
>>> In general, adding the -I options is brittle when precompiled headers are involved. Since you likely don’t have exactly the same -I options when building your PCH as when you use it (you likely have fewer -I paths when building the PCH if it is being widely used) you could potentially have different module hashes in the PCH and in the main file, which leads to trying to load multiple copies of a module from different paths.  I ran into this when I tried to naively add all of the -I options to the module hash.  The heuristic you suggested for dropping the earlier paths might help here, but it would be required for correctness, not merely as an optimization.
>>> 
>>> I'll need to think about this one some more. My initial reaction is that if a PCH loads a module in one configuration and a user of the PCH tries to load the same module in a different configuration (because it's being built with different include paths), that should be a hard error. It seems to me that this is a separate issue from ensuring that we don't have collisions in the cache?
>> 
>> This patch does make it an error if the user of the PCH does not provide a header search context in which  the modules resolve the same way as when the pch was built.  I don’t think expecting the PCH to be created with exactly the same -I options is valuable, because it prevents legitimate reuse.
>> 
>> OK. I think that's great as far as it goes, but it doesn't solve the whole problem, because non-modular #includes can still resolve in different ways.
>> 
>>> I am also not sure what effect this would have on the global module index, since the set of modules to load would span multiple hash-directories and we would need to avoid looking at incompatible modules.  I’m not sure if this is a problem or not.
>>> 
>>> Yes, that's an interesting problem =)
>>>  
>>> Any thoughts?
>>> 
>>> I don't think I've got enough information to suggest something concrete yet, but this strikes me as a possible avenue to explore:
>>> 
>>> 1) Teach module.map files to specify the include paths for the module
>>> 2) When implicitly generating a module.map file, inject the include paths from the host build (probably skipping ones prior to the directory containing the framework)
>>> 3) Include a hash of the contents of the implicit module.map file in the cache entry for any module that has an implicit module.map
>>> 4) Don't include the header search paths in the configuration hash, since they're not used when determining how to build a module any more.
>> 
>> Does this help us with multiple modules having the same name?  I think we would still have the problem that when we go to load a pcm file it may have been built from a different module.map file than the one we found in this compilation (e.g. I have a module MyFoo, and so does another user on the system and we happen to share a module cache directory).  According to 4 we don’t hash the -I paths, so these modules would (potentially) resolve to the same location on disk.
>> 
>> This is convincing me more than ever that  regardless of what else we do, we need to store the path to the module map file with the pcm so that we can verify that the module we are trying to import resolves to the same location that it did when it was built. To me, the filename of the pcm is the most obvious place for this information, since that allows us to restrict ourselves to one module cache directory (ModuleCache/<module hash>) just like now, and it allows modules with the same name to live in the same cache without trampling each other or the possibility of accidentally loading the wrong one.
>> 
>> I'd like to take a step back at this point and summarize what I believe the state to be (sorry that this is rather long). We have three pieces of data that between them uniquely identify the contents of a module:
> 
> Thanks for the summary :)
> 
>> 
>> Build environment configuration:
>> --------------------------------
>> 
>>   This covers things like the language options (and, currently, configuration macros, though those probably belong in the module identity)
>>   We have one global module index per configuration, and we would like each compilation action to only touch a single configuration.
>>   We want to reuse the same build environment configuration as much as possible, to maximize cache hits.
>>   A hash of this data is used as the subdirectory in the module cache.
>> 
>> Module identity:
>> ----------------
>> 
>>   This covers things like the name of the module and the module.map file in which it was defined.
>>   The proposed patch includes a hash of this data in the name of the .pcm file.
>> 
>> Module header:
>> --------------
>> 
>>   This covers things that we discovered when building the module, such as the module's dependencies and their timestamps.
>>   The proposed patch adds the list of imported modules to this, such that header search must find the same module.
>>   If we have a mismatch here, we rebuild the module.
>> 
>> So far, this division seems entirely reasonable to me. Observations:
>>  * We only want to use one global index per build
>>  * We want to minimize the amount of irrelevant stuff in the global index for efficiency reasons (otherwise we could just have one global index for all configurations), and
>>  * We want to minimize the number of times we rebuild modules
>> 
>> From the above, I think that we should use the following guidelines:
>>  * For correctness, the (build environment + module identity + module header) must uniquely identify the module contents (because that's how we determine whether we have a cache hit).
>>  * We want to pick the data that goes into the build environment such that two module builds have the same build environment configuration if they could conceivably be used in the same TU, and we should try to keep their build environments different otherwise.
>>  * We only want to include things in the module header if changes to those things mean we'll probably never want to use the old .pcm file again.
> 
> As long as the list of imported modules lives in the module header, I don’t think this last guideline makes sense.  For example, suppose I have a library A that depends on library B.  I also have two users of library A, and they each require a different version of B.  Or like the case above, I am the developer of B and want to test against A using my modified version.
> 
> What I'm suggesting is that we should strive to put that data somewhere other than the module header, since putting it in the module header will cause unnecessary module rebuilds. If the set of module imports changes, the optimal course of action depends on *why* it changes: if it's due to a filesystem difference, we should rebuild the module (since the old pcm is probably not useful any more), but if it's due to an include path change, we should probably keep the existing .pcm file, since we might want to use it again.
> 
> But that's only the optimal course of action, and we could certainly do something locally suboptimal here if it makes other factors better.
> 
>> The main point of contention seems to be how header search paths fit into the above taxonomy. I think we've discussed a few options here:
>> 
>> 1) They are a property of the module, and are independent of the header search path of the module user. This means they are not part of the above data.
>> 
>> 2) They are a property of the build environment. (This was my original position.) This seems to mean that the module identity can be just the module name, if we enforce that the same module name can't be found multiple times in the include path.
> 
> There is a hole here that Argyrios pointed out in an earlier email.  You can add a new module map file to a path already in the search, and that may change the module identity without changing its name or configuration.  Now if we are enforcing that the module only be found once, you would presumably have to remove/modify the module map it was originally found in.  But to be 100% correct we would still need to include the module map file in the module identity.
> 
> Yes, I agree, and I think you've fully convinced me that we should re-resolve of imported modules on module load (even if the include path is unchanged).
> 
> In fact, you've convinced me that all the changes your patch makes are going in the right direction, so consider this an LGTM on the direction of the patch. Thanks for your perseverance! I'd still like to keep hashing out the details here, because I think we've not yet reached consensus on whether your patch is a complete solution or just a partial one.

Great :) Do you also agree with using the module header to store and then re-resolve the imports or is that still contentious?

> 
>> 3) They are a property of the module identity.
>> 
>> 4) They're not part of anything. The names and resolved paths of imported modules are included in the module header. We hope that the include path didn't affect anything else. (This is what the patch we're discussing does.)
>> 
>> Option 4 seems to be only a partial solution to the include path issue, because we still have cache collisions between modules with different contents due to differing -I paths. Your suggested flag to make non-modular #includes inside a module an error would fix this, but I don't think that is going to work for much real-world code -- I don't think that non-modular #includes are going away.
> 
> I agree this won’t work for all code, but if the standard library is modularized I think it does cover a lot of cases.
> 
>> 
>> I think we can fix option 4 by also including the set of files #included by non-modular includes in the module header (and anything else that depends on the include paths), and re-resolving those files too when we import a module. If the lookup differs, we rebuild the module. This has the unfortunate effect that we might rebuild a module when the prior .pcm is still useful, but that seems to be a rare case. I could probably live with that. Does that sound OK to you?
> 
> When we discussed this internally, we felt that it would be expensive (potentially *lots* of header search lookups), and also hairy to implement (you need not only the #include spelling, but also the “includes” context - maybe more?).
> 
> OK, but if we don't do it, then we can get bogus hits in the module cache. Perhaps those are rare enough that we don't need to worry, but they don't seem fundamentally different from the cache hits you're trying to avoid by re-resolving modules.

Right, I’m not trying to solve this with my patch.  Maybe re-resolving the non-modular includes is the best we can do for modules that don’t have their own search context and will not abide by the “it’s modules all the way down” rule.

> 
>> Another question is what include paths we should use for a module:
>> 
>> 1) They inherit all the include paths from the host build. (We keep the legacy broken model.)
>> 2) They inherit the include paths for themselves and anything later in the header search path.
>> 3) They get to specify their own include path.
>> 
>> (1) is the status quo. I suggest we allow (3) and fall back to (2) if no include path is specified.
> 
> 
> So what exactly does (3) entail? Does it provide the *entire* search context, even for resolving dependent modules?  If so, we are making the module map dependent on the system it is installed on.  If not, then we still need to resolve module names during build to ensure they haven’t changed, regardless of the mechanism we use to find them.
> 
> I think a module should be able to isolate itself from the header search configuration of its users entirely, so I'd expect this to be the entire search context. Another thing we've been discussing internally is the ability for module map files to contain references to other module map files (so library X can say "I depend on module Y, whose module map is over there"), which largely removes the need for include paths.
> 
> I’m also worried that (2) is so close to (1) that it will subtly break users’ expectations.
> 
> I think that's a fair concern. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20140401/be0ae6f0/attachment.html>