[llvm-dev] Refactoring. Using streams for llvm-objcopy.

Mon Jan 18 09:13:28 PST 2021

Folks,

we are trying to reuse some part of llvm-objcopy.
To make it possible we want to move the main implementation of llvm-objcopy
from "tools" subdirectory into the "Object" library(D88827).
One of the problems is using custom buffer 
class(tools/llvm-objcopy/Buffer.h/cpp)
as an output buffer by llvm-objcopy:

Error executeObjcopyOnRawBinary(const CopyConfig &Config, MemoryBuffer 
&In, ***Buffer &Out***);

class Buffer {
   StringRef Name;

   virtual Error allocate(size_t Size) = 0;
   virtual uint8_t *getBufferStart() = 0;
   virtual Error commit() = 0;
}

There are two drawbacks of using custom Buffer class:

1. It is not good to move the custom Buffer class into the general llvm 
Object library.
     It is better to use some standard, already existed solution.

2. Interface of class Buffer assumes that the entire buffer must be 
preallocated.
     i.e. before writing to the output file we need to pre-allocate the 
space.
     Such pre-allocation is not a problem if memory-mapped files are 
used behind Buffer.
     But it could be a wasting of memory resources in other cases.
     A library might be used in a wider number of scenarios than a 
separate tool.
     So it would not be good for the library to work effectively only if 
memory-mapped files are used.

=======================================

We propose to use streams instead of custom Buffer(D91028):

Error executeObjcopyOnRawBinary(const CopyConfig &Config, MemoryBuffer 
&In, ***raw_ostream &Out***);

That solution has the following benefits:

1. it uses standard llvm streams.
2. it does not require pre-allocating of the entire space.
3. it allows easily replace kind of destinations(raw_fd_ostream, 
raw_null_ostream,
     raw_svector_ostream, raw_sha1_ostream, raw_string_ostream).
4. for some usages it could avoid memory allocations at all(using 
raw_sha1_ostream
     as a destination for sha calculation would not require to allocate 
space for the output file).

That solution has the following drawbacks:

1. There is not memory-mapped file implementation for streams.
2. Some formats could not be generated through one pass.
     f.e. the ELF format needs to go back(after the section header table 
is generated
     it needs to go back and update the ELF header).

For the first point, we might create such an 
implementation(raw_mmap_stream).

For the second point, it looks like we have three alternatives:

a) Current implementation of ELF writer already has preliminary steps 
which calculate sizes.
      Before allocating destination buffer, it calculates the size of 
the resulting binary:  ELFWriter<ELFT>::finalize().
      So it looks like all required ELF header information might be 
precalculated during this finalizing step.
      It allows writing data to the output stream by one pass.

b) use raw_pwrite_stream as the output. It would allow seeking and 
updating.

c) use internal memory buffer, generate the file into that memory 
buffer(memory buffer allows to
go back and update) and then stream that buffer into the output.

=======================================

D91028 suggests the following roadmap to replace Buffer with streams:

1. Implement interfaces using raw_ostream:

    Error executeObjcopyOnBinary(CopyConfig &Config, object::Binary &In, 
raw_ostream &Out);

2. Use additional internal buffers for file generation and not change 
the writer's implementation.

     After the files are generated stream buffers into the 
output(raw_ostream &Out).

    Error executeObjcopyOnBinary(CopyConfig &Config, object::Binary &In, 
raw_ostream &Out) {
      MemoryBuffer internal;
      ELFWriter.write(internal);
      Out.write(internal.data(), internal.size());
    }

3. Change the implementation of writers(ELF/COFF/MachO/Wasm) to not use 
internal buffers.
     So that writers store data into the output stream directly.

    Error executeObjcopyOnBinary(CopyConfig &Config, object::Binary &In, 
raw_ostream &Out) {
      ELFWriter.write(Out);
    }

     If all implementations are successful - then leave raw_ostream in 
interfaces.

     If some implementations would still require seek/update 
functionality then change raw_ostream into raw_pwrite_stream:

     Error executeObjcopyOnBinary(CopyConfig &Config, object::Binary 
&In, raw_pwrite_stream &Out);

=======================================

So, what do you think? Would it be good to use streams as an output 
format for objcopy code in Object library?

Or Do we need to use some other solution here?

Thank you, Alexey.