[PATCH] LTO: introduce object file-based on-disk module format.

Fri Jul 4 15:55:33 PDT 2014

On 05/07/2014 01:04, Rafael Espíndola wrote:
>> The code now uses parseBitcodeFile instead of getLazyBitcodeModule, which
>> as far as I can tell causes the reference to the memory buffer in the
>> BitcodeReader to be released.
> I see. So,  the area is a bit of  can of worms. Lets expands a bit:
>
> * ld64 needs to read the bodies early (see bellow).
> * gold does not.
> * llvm-ar does not.
>
> It is also extremely handy to layer LTOModule on top of IRObjectFile
> to avoid code duplication and make sure the symbol resolution in the
> linker stays consistent with what symbols are placed in the archive
> symbol table.
>
> If we end up using IRObjectFile directly for the "first pass" in the
> gold plugin, then it is OK to keep reading the entire bitcode file in
> LTOModule. Otherwise we have to factor that so that we can avoid doing
> that from the gold plugin.  I *think* that using IRObjectFile for the
> first pass makes sense, but I will have to try it and benchmark it a
> bit to be sure.
>
> Constructing MemoryBuffer from StringRef is also a bit odd. One option
> would be to try to make Binary and the IR reader never own the buffer
> and just take StringRef arguments. The only real case I know where
> that can be a problem is the C interface, but if that is really the
> only place we can create a wrapper that is only ever used for C.

After Peter pointed out that problem I investigated a way to solve that 
with MemoryBuffers.

The attached patch adds a simple new facility that you can use like this:

+  std::unique_ptr<MemoryBuffer> secBuffer;
     if (secName == ".llvmbc") {
       StringRef secContents;
       if (std::error_code ec = sec.getContents(secContents)) {
         if (errMsg)
           *errMsg = ec.message();
         return StringRef();
       }
-      return secBuffer;
+      secBuffer = objectfile.createDataReference(secContents);
     }
   }

secBuffer persists the MemoryBuffer resource so it can outlive the 
ObjectFile if needed without reading/opening the file multiple times. 
It's built on top of the new MemoryBuffer::getMemBufferSlice(), also in 
the patch.

I think that once you get past the initial shock of reference counting, 
this makes a great deal of sense and is the only option that offers 
eventual consistency with the centralised file cache manager we 
ultimately want.

Patch attached.

Alp.


>
>>> BTW, the materializeAllPermanently is a very unfortunate but necessary
>>> thing for the C api. The reason it is there is the way ld64 implements
>>> an optimization:
>>>
>>> * We want to avoid putting linkonce_odr symbols in the symbol table
>>> unless their address is significant.
>>> * The way ld64 implements this is being told for every symbol if it
>>> can be dropped from the symbol table.
>>> * To tell ld64 about that we need to look at every use of a symbol.
>>> * To look at every use of a symbol we have to materialize all function
>>> bodies since they may contain an use.
>> I see. I wonder if it would be possible to improve on this by storing def-use
>> information, maybe in the bitcode or in another section of the object file,
>> and only materializing the entities that are needed.
> Maybe, but I think the way forward is to use an api closer to what the
> gold plugin provides. Instead of the linker just giving us a boolean
> (keep this symbol), it says why (used from object, used in symbol
> table, etc) and the plugin can decide to omit it in some cases. This
> lets us delay reading the function bodies until after symbol
> resolution.
>
>>> Two higher level questions about this approach:
>>>
>>> * It is effectively a way of doing fat object files. GCC does
>>> something similar. One small but recurring annoyance I had when
>>> working with GCC LTO is that it can end up falling back silently to
>>> non-LTO since tho outer container is ELF, with IR in a section. With
>>> LLVM right now we get an error, since without our help the linker is
>>> clueless as to what LLVM IR is.
>> I can imagine that this could be a problem if all symbol definitions are always
>> duplicated in the object file, as GCC does (?). But if the symbol definitions
>> in the object file are normally a strict subset of those that would be emitted
>> without LTO (trivially true in the case where the object file contains no
>> allocatable sections, or less trivially if we try to choose functions to emit
>> directly into the object file which we think would not benefit from LTO),
>> we should normally detect this with undefined symbol errors at link time.
>>
>>> Can't you invert this? Put the ELF in
>>> the IR (or even represent .go_export directly in some metadata) and
>>> extent the gold plugin API to handle this?
>> I decided not to use IR as the outer layer because I think that it would be
>> useful for external tools to be able to understand the contents of metadata
>> without needing to understand the ever changing bitcode format.
>>
>> In the specific case of Go, there exist libraries and tools such as [1]
>> which understand ELF and can read the metadata in the object files, and I'd
>> prefer not to have to teach those tools about bitcode.
> BTW, gcc has both thin and fat files and a GO frontend. Do you know
> what do they do about the GO metadata when building with thin files in
> LTO?
>
> How is this represented when targeting OS X or windows? Do they still use ELF?
>
>>> Another advantage is that
>>> it should be easier to support fat COFF and MachO objects too.
>> The patch is already testing that COFF and Mach-O work. Or did you mean
>> something else?
> No, I had missed that it was looking for a ".llvmbc" section in all 3.
>
>>> * If we do need to go with IR in ELF instead of ELF in IR, we also
>>> need to update lib/Object to handle it. We still want to get an
>>> IRObjectFile (by default at least) when given one of those files so
>>> that llvm-nm/llvm-ar do the right thing.
>> Maybe all that is needed is a function that "interprets" SymbolicFiles,
>> i.e. that takes a SymbolicFile and returns back another SymbolicFile (which
>> would be an IRObjectFile in the IR-in-ELF case). This looks like it would be
>> sufficient for llvm-nm/llvm-ar. Later I guess we could extend this function
>> to return multiple SymbolicFiles if we want to pursue the IR-in-object ==
>> IR+object idea.
> Maybe. It would still probably be nicer if the default construction
> functions (like createBinary) would find out the IR in section and use
> that.
>
> Cheers,
> Rafael
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-- 
http://www.nuanti.com
the browser experts

-------------- next part --------------

diff --git a/include/llvm/Object/ObjectFile.h b/include/llvm/Object/ObjectFile.h
index 646abf8..5a889dd 100644
--- a/include/llvm/Object/ObjectFile.h
+++ b/include/llvm/Object/ObjectFile.h
@@ -249,6 +249,12 @@ protected:
                                          uint64_t &Res) const = 0;
   virtual std::error_code getSectionContents(DataRefImpl Sec,
                                              StringRef &Res) const = 0;
+  virtual MemoryBuffer *createDataReference(StringRef Str) const {
+    if (!Data || !Data->containsRegion(Str))
+      return MemoryBuffer::getMemBufferCopy(Str);
+    return MemoryBuffer::getMemBufferSlice(Data.get(), Str.size(),
+                                           Str.data() - Data->getBufferStart());
+  }
   virtual std::error_code getSectionAlignment(DataRefImpl Sec,
                                               uint64_t &Res) const = 0;
   virtual std::error_code isSectionText(DataRefImpl Sec, bool &Res) const = 0;
diff --git a/include/llvm/Support/MemoryBuffer.h b/include/llvm/Support/MemoryBuffer.h
index a520bc9..a62c944 100644
--- a/include/llvm/Support/MemoryBuffer.h
+++ b/include/llvm/Support/MemoryBuffer.h
@@ -60,6 +60,11 @@ public:
     return StringRef(BufferStart, getBufferSize());
   }
 
+  bool containsRegion(StringRef Data) const {
+    return Data.data() >= getBufferStart() &&
+           Data.data() + Data.size() <= getBufferEnd();
+  }
+
   /// getBufferIdentifier - Return an identifier for this buffer, typically the
   /// filename it was read from.
   virtual const char *getBufferIdentifier() const {
@@ -116,6 +121,10 @@ public:
   static MemoryBuffer *getMemBufferCopy(StringRef InputData,
                                         StringRef BufferName = "");
 
+  static MemoryBuffer *
+  getMemBufferSlice(IntrusiveRefCntPtr<MemoryBuffer> Parent, uint64_t Size,
+                    int64_t Offset);
+
   /// getNewMemBuffer - Allocate a new MemoryBuffer of the specified size that
   /// is completely initialized to zeros.  Note that the caller should
   /// initialize the memory allocated by this method.  The memory is owned by
diff --git a/lib/Support/MemoryBuffer.cpp b/lib/Support/MemoryBuffer.cpp
index abf4f60..6251127 100644
--- a/lib/Support/MemoryBuffer.cpp
+++ b/lib/Support/MemoryBuffer.cpp
@@ -418,6 +418,41 @@ std::error_code MemoryBuffer::getOpenFileSlice(
 }
 
 //===----------------------------------------------------------------------===//
+// MemoryBuffer::getMemBufferSlice implementation.
+//===----------------------------------------------------------------------===//
+
+namespace {
+/// \brief Holds a reference to part of another memory buffer.
+class SubMemoryBuffer : public MemoryBuffer {
+  const IntrusiveRefCntPtr<MemoryBuffer> Parent;
+
+public:
+  SubMemoryBuffer(IntrusiveRefCntPtr<MemoryBuffer> Parent, uint64_t Len,
+                  uint64_t Offset)
+      : Parent(Parent) {
+    const char *Start = Parent->getBufferStart() + Offset;
+    const char *End = Start + Len;
+    assert(Start <= Parent->getBufferEnd());
+    assert(End >= Parent->getBufferStart());
+    init(Start, End, false);
+  }
+
+  const char *getBufferIdentifier() const override {
+    return Parent->getBufferIdentifier();
+  }
+
+  BufferKind getBufferKind() const override { return Parent->getBufferKind(); }
+};
+}
+
+MemoryBuffer *
+MemoryBuffer::getMemBufferSlice(IntrusiveRefCntPtr<MemoryBuffer> Parent,
+                                uint64_t Size, int64_t Offset) {
+  assert(Parent);
+  return new SubMemoryBuffer(Parent, Size, Offset);
+}
+
+//===----------------------------------------------------------------------===//
 // MemoryBuffer::getSTDIN implementation.
 //===----------------------------------------------------------------------===//