[PATCH] D114841: [lld-macho] Fix duplicate symbols with relocatable objects

Mon Jan 31 17:12:18 PST 2022

keith added inline comments.

================
Comment at: lld/MachO/Driver.cpp:322-324
   case file_magic::macho_object:
-    newFile = make<ObjFile>(mbref, getModTime(path), "");
+    newFile = loadedObjects[path] = make<ObjFile>(mbref, getModTime(path), "");
     break;
----------------
oontvoo wrote:
> int3 wrote:
> > keith wrote:
> > > int3 wrote:
> > > > oontvoo wrote:
> > > > > smeenai wrote:
> > > > > > oontvoo wrote:
> > > > > > > int3 wrote:
> > > > > > > > oontvoo wrote:
> > > > > > > > > int3 wrote:
> > > > > > > > > > keith wrote:
> > > > > > > > > > > int3 wrote:
> > > > > > > > > > > > oontvoo wrote:
> > > > > > > > > > > > > keith wrote:
> > > > > > > > > > > > > > oontvoo wrote:
> > > > > > > > > > > > > > > On a second thought, this is suspicious. Doesn't this mean `%lld foo.o foo.o ` wouldn't yield "duplicate symbols" error anymore?
> > > > > > > > > > > > > > Yea I mentioned this with a top level comment as well, it does break the related test. I need to find a better heuristic for skipping these, since we only want to do it for LC_LINKER_OPTION's / duplicate -framework commands etc (not sure if the latter is handled today or not, especially as it conflicts with -framework foo -weak_framework foo etc).
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I was hoping to do this specifically in the LC_LINKER_OPTION logic to avoid changing things here, but the issue is the order of hitting those options vs the command line options depends on the order things are passed, so I couldn't rely on that.
> > > > > > > > > > > > > Ah, I've missed the previous comments. Sorry!
> > > > > > > > > > > > > 
> > > > > > > > > > > > > In any case, I can think of a contrived example where you have the input obj files (standalone or in archives) containing no symbols but contributing to global ctors. In that case, wouldn't we want the side effect of these being loaded individually? 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > I don't think it'd be terribly hard to handle this, we'd just need an extra boolean to `addFile` to indicate if a file is being loaded via `add{Framework,Library}` or if it's being loaded directly via a path. That said... I'm not sure there's much value in handling this edge case. I'm more than happy to just dedup everything and document this behavioral difference in ld64-vs-lld.rst.
> > > > > > > > > > > > 
> > > > > > > > > > > > > I can think of a contrived example where you have the input obj files (standalone or in archives) containing no symbols but contributing to global ctors
> > > > > > > > > > > > 
> > > > > > > > > > > > What does 'contributing to global ctors' mean exactly?
> > > > > > > > > > > FWIW tensorflow is how I discovered this. They produce an object file in a framework and you get corresponding linker options loading it if you include it from swift. So our app doesn't link in that case. On the swift side there is a private flag to exclude the linker option, but that's not ideal
> > > > > > > > > > > I'm not sure there's much value in handling this edge case
> > > > > > > > > > 
> > > > > > > > > > Just to be clear, I meant there's no much value in raising duplicate symbol errors when an object file is specified twice on the command line. We should definitely handle object files in frameworks that get loaded multiple times
> > > > > > > > > > Just to be clear, I meant there's no much value in raising duplicate symbol errors when an object file is specified twice on the command line. 
> > > > > > > > > 
> > > > > > > > > Ok, agreed there's no value in raising duplicate error (that's just the side effect of not loading the file again)
> > > > > > > > > 
> > > > > > > > > > What does 'contributing to global ctors' mean exactly?
> > > > > > > > > 
> > > > > > > > > The "contrived" case I'm talking about is this:
> > > > > > > > > 
> > > > > > > > > ```
> > > > > > > > > 
> > > > > > > > > // main.c
> > > > > > > > > #include <stdio.h>
> > > > > > > > > void a_ctor() {
> > > > > > > > >   // do stuff
> > > > > > > > >   printf("ctor called\n");
> > > > > > > > > }
> > > > > > > > > 
> > > > > > > > > int main() {return 0; }
> > > > > > > > > 
> > > > > > > > > // my_init.s
> > > > > > > > > // This file defines no symbol - but it sticks the _a_ctor into the global ctors list
> > > > > > > > > L_.str:  
> > > > > > > > > 	.section	__DATA,__mod_init_func,mod_init_funcs
> > > > > > > > > 	.p2align	3
> > > > > > > > > 	.quad	_a_ctor
> > > > > > > > > ```
> > > > > > > > > 
> > > > > > > > > Link it ` ld <other args> init.o init.o main.o`
> > > > > > > > > 
> > > > > > > > > Run it:
> > > > > > > > > 
> > > > > > > > > ```
> > > > > > > > > $ ./my_main.out 
> > > > > > > > > ctor called
> > > > > > > > > ctor called
> > > > > > > > > ```
> > > > > > > > > 
> > > > > > > > > With this patch, it'll only execute the ctor once
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > oh right I see what you mean. yeah seems very edge-case-y...
> > > > > > > In normal applications, this might be an edge case, yes, but it's not uncommon to see all kinds of creative usages of custom init (eg., the sanitizers do it all the time).
> > > > > > > 
> > > > > > > To think of it another way, if we were to propose this change to other ports (eg ELF), do you think they'd accept it? 
> > > > > > > If yes, then ok 🤐 
> > > > > > > If no, then why would MachO do it?  (it's a weird behaviour ... and tbh, I don't think they'd accept it )
> > > > > > > 
> > > > > > > But I've said my piece and it seems your position has the popular vote here :P so ...
> > > > > > > 
> > > > > > > ("this change" specifically being "if you see multiple identical input paths then, only look at one of them")
> > > > > > I believe LLD COFF already deduplicates input files (but I also think they're just emulating Microsoft's linker's behavior there).
> > > > > That's fair.
> > > > > Feel free to revert to the simpler version. :)
> > > > @oontvoo I think you make good points, but I also think that the rare builds relying on this can easily restructure themselves to not have to load the same file multiple times.
> > > > 
> > > > Moreover, LLD-ELF already deviates from bfd in several ways, and I'm not sure this additional discrepancy would be that egregious.
> > > Sorry to resurrect this one, but I think with my current implementation of tracking implicit vs explicit, we get the best of both worlds here, I tested the ctors example above and it gets called twice as expected, but duplicate symbols still get reported as expected as well. Let me know if you have any concerns with it at this point though!
> > Hm, I think the current implementation looks a bit complicated and may have subtle bugs. For instance, what happens if a library is loaded implicitly first, and then subsequently is loaded explicitly twice in a row? I *think* LLD would end up loading it only once with this implementation...
> > 
> > A possibly simpler implementation would be to hoist `DylibFile::explicitlyLinked` into the parent `InputFile` class, then combine the two `loaded*Objects` hashmaps into one.
> > 
> > Personally though I'd prefer a simpler always-dedup implementation -- there'd be fewer potential bugs to avoid :)
> While I'd prefer the LD64's behaviour, I think I agree with int3 that simple implementation here is better. 
> 
> So my vote would have been:
> (a) ignore this patch (ie., preserve the non-dedup)
> (b) revert this patch to the earlier revision (without all the complexity to support the use case I'd bought up).
> 
> (a) is clearly not a good idea, hence 👍 for (b)
> 
I just pushed a different approach here. I let us get a bit off track because of how this was implemented, as far as I know at this point this is only a problem with frameworks, and the only reason I had implemented it here was because this is where we already had the info to know if it was one of these cases. I just pushed hoisting this out into addFramework itself instead, which reads the magic there and makes this decision and caching. This shouldn't have any performance impact on reading twice since the first time we read the file we cache it in memory anyways, so in the case we don't do anything, and immediately call addFile, it should still be fine. Let me know what you think!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114841/new/

https://reviews.llvm.org/D114841