[PATCH] D109972: Regenerate LC_CODE_SIGNATURE during llvm-objcopy operations

Daniel Rodríguez Troitiño via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Tue Sep 21 14:14:24 PDT 2021


drodriguez added a comment.

@steven_wu: Thanks for the feedback. I think at this point we are completely conviced that `libObject` was not the right place. We are trying to find a good place, and we would like, as much as possible to not duplicate the same code or logic in two tools. It is not a lot of code right now, but the potential divergence in the future scares us a little.

The usecase for `llvm-objcopy` is that when trying to use `llvm-strip` on binaries linked by `ld64` or `lld` the signature is currently copied verbatim, which makes the binary invalid in Apple Silicon machines. The changes here recalculates the signature in order to make the binary valid again. This seems to be what happens when using Xcode's own `strip` (which is not based on LLVM), and it is something that we wanted to replicate in order to keep being compatible. I am not sure about what you refer to with the "bincompat problem", but if you elaborate, I can try to figure out if there will be a problem that we haven't had into account.



================
Comment at: llvm/tools/llvm-objcopy/MachO/MachOWriter.cpp:441
+    }
+    object::CodeSignatureSection SignatureBuilder =
+        object::CodeSignatureSection(
----------------
steven_wu wrote:
> drodriguez wrote:
> > alexander-shaposhnikov wrote:
> > > drodriguez wrote:
> > > > alexander-shaposhnikov wrote:
> > > > > drodriguez wrote:
> > > > > > alexander-shaposhnikov wrote:
> > > > > > > It looks like I've missed the previous diff https://reviews.llvm.org/D109803, sorry about being late.
> > > > > > > 
> > > > > > > 1/ MachO.h is probably not the ideal place for this functionality, according to the top file comment "MachO.h declares the MachOObjectFile class". My first impression is that the public declaration probably should be placed into a separate file, e.g. "MachOCodeSignature.h" (and we need a detailed comment explaining what this class or function does)
> > > > > > > 
> > > > > > > 2/ The responsibility of the class CodeSignatureSection seems to be somewhat unclear. Perhaps, we don't really need a class here at all. What would you say to just creating a function, e.g. std::string buildCodeSignatureStab(...) or void writeCodeSignatureStab(...) with some clear separation of the input / output and removing the unrelated pieces of functionality (e.g. msync on the buffer, this is probably not suitable for a library function) ? (and not expose the implementation details in the header file).
> > > > > > > 
> > > > > > > 3/ Some minor comments, e.g. stripOutputFilePath - I suspect Suport/Path.h might already have some helper utilities
> > > > > > > 
> > > > > > > What I would recommend here - can we (temporarily) take a step back and  reopen & revert D109803, if it were not a library interface I would not be so worried, but for some common and frequently used libraries like libObject I would highly suggest we should clean up the interface first and I'm more than happy to help.
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > Hi Alex,
> > > > > > 
> > > > > > We would prefer to fix forward instead of reverting, if possible. As it sits, D109803 haven't changed anything and it is just "dormant" code as far as LLVM respects. Only this diff uses that functionality.
> > > > > > 
> > > > > > I will review your comment next week more carefully and provide better answers.
> > > > > > 
> > > > > > 1/ I think we can do this. It should not be a problem.
> > > > > > 2/ The class tries to imitate the existing code in LLD, which was a class. There's two related functionalities to the class: measuring the size, and serializing. For both one needs the same inputs, so it was interesting to tie data and behaviour together.
> > > > > > 2b/ `msync` on the buffer is actually important, and `llvm-objcopy` might not work unless we make that call. IIRC without that, if one was writing the same file, and that file was mmap'ed into memory, the kernel will not realize that the signature had changed, and will not evaluate again. I can look for a reference that explained the problem in a little bit more detail.
> > > > > > 3/ We realize that it was not portable for Windows, and Nuri changed that this afternoon to use `llvm::path` instead. This diff has the fix, since the previous one was already landed.
> > > > > > 
> > > > > > Thanks for the help offering. Any feedback is welcome. We will try to get to the best version of this that satisfies everyone's requirements.
> > > > > Hi! Many thanks!
> > > > > My 0.02$ . The current design looks a little bit suboptimal, especially for a library, so I suggest let's improve the interface first and put things into the right place. It's a bit hard to post comments outside of the code context, but yeah, I'm late to the party. The LLD-specific details ("CodeSignatureSection") can still live in the LLD codebase and make use of the function from libObject.
> > > > > If there are no significant performance differences I would be biased towards readability / simplicity, e.g. one can consider something like this (I didn't verify the details, so maybe some parameters are unnecessary or, instead, missing)
> > > > > ```
> > > > > SmallVector<char, 64> buildCodeSignatureStab(
> > > > >     uint64_t SignatureOffset,
> > > > >     MachO::HeaderFileType FileType, StringRef FilePath,
> > > > >     uint64_t TextSegmentOffset, StringRef TextSegmentContent)
> > > > > 
> > > > > ```
> > > > > and in LLD the method CodeSignatureSection::finalize would call this function - would that be sufficient ? (maybe I'm missing something, feel free to correct me). (Plus in this case we will avoid having two different  CodeSignatureSection classes)
> > > > > and (mostly motivated by the "separation of concerns") I'd leave the unfortunate workaround with msync on the application's side, hopefully it'll be gone in the future.
> > > > > 
> > > > > 
> > > > About the ideal place for this code:
> > > > 
> > > > We thought about libObject and specifically MachO.h because it seemed like the best place, but we are fine relocating the code if necessary. We have also seen libBinaryFormat which includes some support for MachO and which might be a good place for these. The basic structs for the code signature (`CS_CodeDirectory`) is there, so it might be a good idea to keep everything together. Our only requirement is that it should be accessible to both llvm-objcopy and LLD to avoid code duplication.
> > > > 
> > > > These are not LLD specific details, as far as we understand, but part of how MachO binaries are put together and every tool that creates a MachO binary will need to play by these rules (they apply to each slice independently, so in our testing Universal MachO does not need changes, but there might be edge cases we are not aware).
> > > > 
> > > > About the shape of the code:
> > > > 
> > > > We don’t mind removing the class and having a couple of free functions instead, but there is the possibility of the responsibilities of the class growing bigger in the future, and we would like to avoid functions with a lot of parameters (there’s already 4 that both the serialization and the calculation of the size would need to receive).
> > > > 
> > > > About your proposal of `buildCodeSignatureStab`, is there an example of that pattern already in the code. We have looked for similar constructs and we cannot find it. Also, being “Stab” a term of art related to dSYM, maybe it is not the best name to use (alternatives can be something like “Blob”, or use “serialize” instead of “build”). Also, using `SmallVector` might be limited for this usage since the actual signature depends on the size of the binary, and for each 4 KB of binary we need an extra 32 bytes. I have seen signatures being 15 KB for LLVM binaries. We would prefer the versions that write directly into memory to avoid excessive copy of data.
> > > > 
> > > > Finally, about `msync`:
> > > > 
> > > > While we can move the `msync` calls into the application code, and add a note in the documentation of the method that the caller must ensure that `msync` gets invoked, we think it is better to keep the `msync` only in one place with a clear explanation of why it is necessary.
> > > > 
> > > > The details are in https://openradar.appspot.com/FB8914231 but in short, when mmaping an executable, the kernel seems to cache the results of the signature verification and might not check for the validity of the binary again, even if the signature gets regenerated. This is very typical while iterating on a binary, and creating the binary over and over, so it is something that can happen in both LLD and llvm-objcopy (specially in llvm-strip).
> > > > 
> > > > Apple might fix their side eventually, but until then we need the workaround everywhere, so we think a central place is better.
> > > > 
> > > 
> > > 
> > > 1.
> > > >is there an example of that pattern already in the code. We have looked for similar constructs and we cannot find it.
> > > 
> > > If I'm not mistaken libObject was initially designed to present a read-only view of object files and closely models the binary formats, so, in general, what was added in https://reviews.llvm.org/D109803 does not have many analogs or counterparts there (to the best of my knowledge), although libObject contains some functionality to write archives and universal binaries.
> > > 
> > > I find the interface introduced in https://reviews.llvm.org/D109803 quite confusing and still think it would be better to revisit that decision (and revert for now) before it has propagated further. This can be subjective so I'm happy if somebody else familiar with libObject weighs in, maybe I'm missing something.
> > > 
> > > 2.
> > > If I am not mistaken this code doesn't use the signer's identity in any way so it's more like a collection of hashes.
> > > That's why above I called it a "stab". I think it also indicates that we need a detailed comment (in the source code) that should clarify these aspects.
> > > 
> > > 3.
> > > >These are not LLD specific details, as far as we understand, but part of how MachO binaries
> > > LLD uses the term "section" to model various parts of the  LINKEDIT segment, however, the binary format does not use this term (to the best of my knowledge) for this purpose. The class  object::CodeSignatureSection looks very similar to what was initially implemented in LLD but I'm not sure it's a good fit for libObject, this is what I was referring to as "LLD-specific details" above.
> > > 
> > > Subscribing more people who are familiar with libObject & Mach-O:
> > > cc @mtrent , @lattner, @steven_wu 
> > Seems fair that `libObject` is mainly for reading files. In `libBinaryFormat` is were the some of the struct used by `LC_CODE_SIGNATURE` are defined. There are small functions in there for MachO (mostly getters, but some setters). Would that be a good place? It also has a `MsgPackWriter`, which seems mostly unrelated, but seems to show that "writers" are OK in that library.
> > 
> > About being confusing, we are fine changing the couple of methods into free functions if that allow us to advance further. We chose a class because it was already that shape in LLD, and it made sense to keep that design. Would that be OK for everyone? It would be one function for measuring size, and one function to write the headers of the data contained in `LC_CODE_SIGNATURE` and calculate the appropriate hashes.
> > 
> > About the name "code signing": you are completely right that this does not use the signer identity at all and it is just a bunch of hashes. I think it should be understood more as a binary "signature", than a cryptographic signature of the code. It is mostly an HMAC, but without the cryptographic pieces. From what I understand it is just a way for the linker to output "verified" binaries in Apple Silicon Macs without having to setup signing identities and stuff like that.
> > 
> > We will add more comments to the code to try to explain better the purpose and which functionality is implemented.
> > 
> > About the term "section": Yes, LLD implements the linkedit segment with synthetic sections, and that name did slip, and I did not realize what you were referring to. These are not what MachO understands as sections. We will change that and probably make it clear that it is related to the `CS_CodeDirectory`, `CS_SuperBlob` and `CS_BlobIndex`.
> > 
> > In summary, if everyone agrees:
> > 
> > - Move the functionality outside `libObject` and probably into `libBinaryFormat`, where the struct `CS_CodeDirectory` is already living.
> > - Transform the functionality into a couple of free functions that receive all the necessary information.
> > - Do not mention "Section" as related to any of this functionality. Also try to not mention "code signature" that much, if a better name is available.
> > - Also the feedback provided by James Henderson above.
> > 
> > We hope the plan sounds good to everyone. Let us know if someone has more feedback that we can integrate.
> I don't think `libBinaryFormat` is a good place to implement write.
> 
> I think a good design should be following:
> * libObject (CodeSignatureSection) knows how to read the section and generations section with data
> * CodeSignatureSection can return generated section data in a buffer if it has to.
> * Binary write should be implemented inside tools (like llvm-objcopy, lld or yaml2obj). That is not really a code duplication.
@steven_wu: Thanks for the input. Let me try to see if we understand each other correctly.

- Neither LLD, nor llvm-objcopy are not really interested in "reading" the contents of the data from the `LC_CODE_SIGNATURE`. I would not like to design an API that has no users. If we needed to read from the section, and after Alexander explanations above, we now know that the right place is `libObject`, and we will add the code there, if necessary.
- To generate the data one needs access to the fully layout binary (slice). The process involves hashing every page of the binary (4 KB) and appending a hash of the contents. The data "header" can be written without access to the binary contents, but one still needs some information (mainly about offsets).
- The binary write that we have tried to abstract from the tools still require the tools to provide a buffer where the data is written (and a buffer where the binary contents are read, to generate the hashes). It was a happy coincidence that both tools worked with buffers, but if at some point in the future the writting mechanism have to be abstracted away, it should still be possible. I would not like to have the same process in each of the tools, if possible, since there is tricky pieces that would be bad to have in two parts of the code base (this was the main reason of bringing the LLD implementation into LLVM: we didn't want to simply duplicate the code).
- We are fine moving the `msync` out of this and into the tools, if that's necessary to make everyone happy. We do not think it is the right decision, but we are fine making that change.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109972/new/

https://reviews.llvm.org/D109972



More information about the llvm-commits mailing list