[llvm-dev] RFC: Adding llvm::ThinStream

Peter Collingbourne via llvm-dev llvm-dev at lists.llvm.org
Wed Feb 22 12:28:50 PST 2017


On Sat, Feb 18, 2017 at 5:09 PM, Zachary Turner via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Some background:
>
> A while back while working on code to read / write PDB files, I came up
> with Yet Another Stream Abstraction.  Note that LLVM already has a few.
> Off the top of my head, theres:
>
> 1) `MemoryBuffer` and its associated class hierarchy
> 2) `raw_ostream` and it's associated classes.
> 3) `DataExtractor` which is used for reading from a StringRef.
>
> There's probably more, and indivdiual subprojects might have even created
> their own.
>
> The reason I couldn't use any of these and needed to invent another is
> because PDB files are not laid out contiguously in memory.  You can think
> of it as a file system where there is an MFT that defines the blocks that
> individual files live on, and then you have to re-sequence the blocks in
> order to read out a "file".
>
> To add to the complexity, each "file" inside of here is actually a list of
> variable length records where records, or even individual fields of records
> might cross a block boundary and require multiple reads to piece together.
> I needed a way to view these streams as being contiguous, and reading data
> out of them allowing all the magic of broken fields and records,
> discontiguous records, etc to just disappear behind the interface.  Also, I
> needed the ability to read and write using a single interface without
> worrying about the details of the block structure.  None of LLVM's existing
> abstractions provided convenient mechanisms for writing this kind of data.
>
> So I came up with the following set of abstractions:
>
> *ThinStream* - An abstract base class that provides read-only access to
> data.  The interface is:
>   virtual Error readBytes(uint32_t Offset, uint32_t Size,
> ArrayRef<uint8_t> &Buffer) const = 0;
>   virtual Error readLongestContiguousChunk(uint32_t Offset,
> ArrayRef<uint8_t> &Buffer) const = 0;
>   virtual uint32_t getLength() const = 0;
>
> An important distinction between `ThinStream` and existing stream
> implementations is that API encourages implementations to be *zero copy*.
> Instead of giving it a buffer to write into, it just returns you a slice of
> the existing buffer.  This makes it *very efficient*.  Similar to
> `ArrayRef` / `MutableArrayRef`, I also provide *WritableThinStream* for
> cases where your data is not read-only.  This is another area where
> functionality is provided that was not present in existing abstractions
> (i.e. writeability).
>
> I have several implementations of this class and some abstractions for
> working with them.  *ByteStream* and *MutableByteStream* provide a
> concrete implementatino where the backing store is an `ArrayRef` or
> `MutableArrayRef`.  *MappedBlockStream* (which is PDB specific) provides
> an implementation that seeks around a file, piecing together blocks from
> various locations in a PDB file.  When a call to `readBytes` spans a block
> boundary, it does multiple reads, allocates a contiguous buffer from a
> `BumpPtrAllocator` and returns a reference to that (subsequent requests for
> the same offset return from the cached allocation).  There is also
> *FileBufferThinStream* which adapts an `llvm::FileOutputBuffer` so you
> can write to a file system object.  One could easily imagine an
> implementation that adapts `llvm::MemoryBuffer` so that you could read and
> write from mmap'ed files.
>
> But all of these just allow reading and writing raw bytes.  To handle
> reading and writing semantic data, there are two additional classes.
> *ThinStreamReader* and *ThinStreamWriter.*
>
> These accept any subclass of `ThinStream` and maintain an offset and allow
> you to read integers in any endianness, strings, objects, arrays of objects
> (both fixed length records and variable length records), and various other
> things.
>
> Finally, there are *ThinStreamRef* and *WritableThinStreamRef* which you
> can think of as `ArrayRef` and `MutableArrayRef` for ThinStreams.  They
> allow slicing, dropping, etc and copy-semantics so that you can easily pass
> streams around without worrying about reference lifetime issues.
>
> To re-iterate: *When using a ThinStreamReader, copies are the exception,
> not the rule and for most implementations of ThinStream will never occur.*
> Suppose you have a struct:
>
> struct Header {
>   char Magic[48];
>   ulittle16_t A;
>   ulittle16_t B;
>   ulittle64_t C;
> };
>
> To read this using a `ThinStreamReader`, you would write this:
>
> ThinStreamReader Reader(Stream);
> const Header *H;
> if (auto EC = Reader.readObject(H))
>   return EC;
>
> and `ThinStreamReader` just reinterpret_casts the underlying bytes to your
> structure.  The same is true for null terminated strings, arrays of
> objects, and everything else.  *It is up to the user to ensure that reads
> and writes happen at proper alignments.*  (LLVM can still assert though
> if you do a misaligned read/write).  But when reading and writing records
> in binary file formats, this is not a new issue.
>
> The proposal:
> This code has been used in LLVM's PDB library for some time, and I want to
> move it up to Support.
>
> I've discussed this with some people offline.  beanz@ has expressed
> interest for some work he wants to do on libobject.
>

Out of curiosity, how do we expect this to be useful in libobject?

Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170222/1fa3d320/attachment.html>


More information about the llvm-dev mailing list