[PATCH] D83023: [lldb/ObjectFileMachO] Fetch shared cache images from our own shared cache

Fri Jul 3 01:03:19 PDT 2020

labath added a comment.

In D83023#2128475 <https://reviews.llvm.org/D83023#2128475>, @friss wrote:

> In D83023#2128298 <https://reviews.llvm.org/D83023#2128298>, @labath wrote:
>
> > I think this is a very interesting feature (lldb being able to load modules from memory; the mac shared cache thingy is interesting too, but in a different way). We have a feature request from people who are downloading modules from a network (from a proprietary symbol server, etc.) and would like to pass them to lldb without having to serialize them to disk. This would be a step towards making that happen. It could also be useful for our own unit tests which now have to do a similar thing.
> >
> > However, I think this could use some clean up. There's a lot of juggling of data/file/object offsets going on, and it all seems inconsistent and avoidable to me. Please see inline comments for details.
>
>
> I'll see what can be done. My main goal while working on this was to avoid changing the semantics outside of the shared cache usecase. I understand fairly well the codepath that I added and then just moved some other bits around to keep the existing semantics for the rest. Happy to rework this.

So, if an object file needs to access some data which is outside of "its" image then my idea about using sliced data buffers will probably not work. It that case, using the "object offset" field to communicate the location of the "object" might not be a bad idea (it's still different than the use in .a files, but maybe we can stretch that definition). The part that bugs me then is having this functionality key off of the "data" field being set. Ideally, these would be two orthogonal features:

- the "data" would control whether you read the file system to obtain the object contents
- the "object offset" would tell you where to locate the desired object inside these "jumbo" objects

I think that might be doable by adding one more argument to the ObjectFile::GetModuleSpecifications interface, which says "this buffer contains a lot of stuff, but the object I am really interested in is at <offset>". Then ObjectFileMachO could do what it needs to read the object from that offset, but it would still have access to the entire buffer to read the __LINKEDIT thingy.

Or.... we could try to do something completely different, and avoid touching the common interfaces altogether. The way I see it, this shared cache thing is unlikely to be reusable for anything else, and the reason we're adding these interfaces is so that we can communicate information from DynamicLoaderDarwin to ObjectFileMachO through `Target::GetOrCreateModule`. But that is silly, because DynamicLoaderDarwin already knows that the request is going to go to ObjectFileMachO (and indeed things would break if it doesn't). If there was a way for these two to communicate directly, then all of this would be unneeded, because DynamicLoaderDarwin could use a ObjectFileMachO interface, which would be specially crafted for this purpose.

It also seems to me we are only really interested in the "Get" part of `Target::GetOrCreateModule` -- i.e., ensuring we don't create multiple module instances for the same object. The rest of the function deals with various ways to try to find a module, but that's the thing we actually want to avoid, as we already know the data buffer that contains it. And all of the checks about matching uuids/architecures/etc. don't seem useful either as DynamicLoaderDarwin is repeating most of these anyway.

We actually already have an interface that almost does this: `Module::CreateModuleFromObjectFile`. It allows one to create a Module while directly specifying the ObjectFile class to use, and passing arbitrary constructor arguments. The part it does *not* do is register this module into the global module cache. That's because so far, this function is used to create "weird" modules that we wouldn't want to cache anyway. But, what if we created a version of this function which does that? Could `DynamicLoaderDarwin` then instead of `target.GetOrCreateModule(shared_cache_spec)` do something like:

  module_sp = target.GetImages().FindFirstModule(shared_cache_spec);
  // If target already contains the module, we're done.
  if (!module_sp) {
    // This checks the global module cache for a matching module. If it finds one, it returns it. Otherwise, it creates a module through CreateModuleFromObjectFile, and registers it into the cache.
    module_sp = Module::GetOrCreateModuleFromObjectFile<ObjectFileMachO>(shared_cache_spec, whatever_args_we_need_to_get_macho_to_load_properly);
    target.GetImages().Append(module_sp, /*notify=*/false);
  }

I think an interface like this could be useful in the future as an escape hatch, because there are a lot of situations where one already knows he is dealing with a specific kind of object files, but is not able to take advantage of that. For example, DynamicLoaderPOSIXDYLD "knows" it's going to load an ELF file, that knowledge does not help it in any way. So far, we haven't needed to do anything super weird there, but that class already does contain a bunch of fallbacks/workarounds for bugs in various dynamic loaders and other platform features (e.g. the VDSO pseudo-module). I can certainly see how being able to create an module more directly could be helpful at times.

> 
> 
>> The patch is also quite light on testing. If done right, I believe this should make it possible to yaml2obj a file into memory in a unit test and then create a Module from that. That would enable us to test the Module(Spec) changes in isolation, and move them to a separate patch.
> 
> I agree about the light testing. FWIW, this got significant living-on testing on Apple platforms, but some more targeted testing would be great.

Yeah, I'm sure it works now. I'm more worried about it continuing to work in face of other random changes. :)

================
Comment at: lldb/include/lldb/Utility/DataBuffer.h:82

+class DataBufferHostMemory : public DataBuffer {
+public:
----------------
friss wrote:
> labath wrote:
> > All of our other DataBuffers also point to host memory (how could they not do that?). I guess what really makes this one special is that it does not own the storage that it points to...
> Good point. What about `DataBufferUnowned` ? (and adding a comment to explain what it's used for)
Sounds good. I don't think that the comment really needs to mention the shared cache. I can see this being useful in other circumstances too...

================
Comment at: lldb/source/Core/Module.cpp:154-159
+  // If the ModuleSpec provided a DataBuffer, let's respect the ModuleSpec's
+  // file offset when reading in this buffer.
+  if (data_sp) {
+    file_offset = module_spec.GetObjectOffset();
+    file_size = module_spec.GetData()->GetByteSize();
+  }
----------------
friss wrote:
> labath wrote:
> > I think this overloads the meaning of `module_spec.GetObjectOffset()` in a fairly confusing way. Normally, I believe `ModuleSpec::GetObject{Name,Offset,Size}` is used to refer to the name/offse/size of a object file located in an archive (.a). However, here it seems you are reusing it for something else. It seems unfortunate that the meaning of this field should change depending on the "data" field being present.
> > 
> > What exactly is the purpose of this field? Could we avoid this by just creating an appropriately-sized DataBuffer ?
> So the shared cache is some kind of a container in some ways similar to a `.a`, file, in some ways very different. The main difference is that the contained dylibs are not each in their own subpart of the file. For example, the __LINKEDIT segment is shared by all of them. The load commands that describe the images are relative to the shared cache itself, not to one specific image.
> 
> So I reused the object_offset field in its "offset from the beginning of a container sense". 
> 
> I think I tried having the SharedCacheInfo return only the subpart of the cache that contains the image, and ran into issues with this. But I don't remember exactly what, and I have a much better understanding now, so I should try it again.
Hmm... interesting. I'm going to touch on this more in the main comment.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D83023/new/

https://reviews.llvm.org/D83023