[llvm-dev] RFC: Adding llvm::ThinStream

Wed Feb 22 13:15:08 PST 2017

I'll let Chris answer this one since he was the one that mentioned that
with me, but there's one more thing I forgot to mention.

It's still up in the PDB library, but I also have a class called
CodeViewRecordIO which is similar in spirit to YamlIO.  You initialize it
with either a BinaryStreamReader or BinaryStreamWriter, and then instead of
calling functions like readInteger, writeInteger, readObject, or
readZeroString, you call mapInteger, mapObject, mapZeroString.  If it's
initialized with a reader, it reads, otherwise it writes.

This turned out to be extremely useful as it allowed me to merge the
reading and writing codepaths into one codepath.  We had bugs before where
we couldn't round-trip a PDB because we would write one thing and read
another thing (maybe we forgot to skip some padding or something on either
the read or the write).  When there's one codepath, it almost becomes
declarative.  You just write what order the fields come in, and then both
reading and writing work automatically.

It's not ready to be generalized alongside this quite yet, but that would
be the next logical step.

On Wed, Feb 22, 2017 at 12:28 PM Peter Collingbourne <peter at pcc.me.uk>
wrote:

> On Sat, Feb 18, 2017 at 5:09 PM, Zachary Turner via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Some background:
>
> A while back while working on code to read / write PDB files, I came up
> with Yet Another Stream Abstraction.  Note that LLVM already has a few.
> Off the top of my head, theres:
>
> 1) `MemoryBuffer` and its associated class hierarchy
> 2) `raw_ostream` and it's associated classes.
> 3) `DataExtractor` which is used for reading from a StringRef.
>
> There's probably more, and indivdiual subprojects might have even created
> their own.
>
> The reason I couldn't use any of these and needed to invent another is
> because PDB files are not laid out contiguously in memory.  You can think
> of it as a file system where there is an MFT that defines the blocks that
> individual files live on, and then you have to re-sequence the blocks in
> order to read out a "file".
>
> To add to the complexity, each "file" inside of here is actually a list of
> variable length records where records, or even individual fields of records
> might cross a block boundary and require multiple reads to piece together.
> I needed a way to view these streams as being contiguous, and reading data
> out of them allowing all the magic of broken fields and records,
> discontiguous records, etc to just disappear behind the interface.  Also, I
> needed the ability to read and write using a single interface without
> worrying about the details of the block structure.  None of LLVM's existing
> abstractions provided convenient mechanisms for writing this kind of data.
>
> So I came up with the following set of abstractions:
>
> *ThinStream* - An abstract base class that provides read-only access to
> data.  The interface is:
>   virtual Error readBytes(uint32_t Offset, uint32_t Size,
> ArrayRef<uint8_t> &Buffer) const = 0;
>   virtual Error readLongestContiguousChunk(uint32_t Offset,
> ArrayRef<uint8_t> &Buffer) const = 0;
>   virtual uint32_t getLength() const = 0;
>
> An important distinction between `ThinStream` and existing stream
> implementations is that API encourages implementations to be *zero copy*.
> Instead of giving it a buffer to write into, it just returns you a slice of
> the existing buffer.  This makes it *very efficient*.  Similar to
> `ArrayRef` / `MutableArrayRef`, I also provide *WritableThinStream* for
> cases where your data is not read-only.  This is another area where
> functionality is provided that was not present in existing abstractions
> (i.e. writeability).
>
> I have several implementations of this class and some abstractions for
> working with them.  *ByteStream* and *MutableByteStream* provide a
> concrete implementatino where the backing store is an `ArrayRef` or
> `MutableArrayRef`.  *MappedBlockStream* (which is PDB specific) provides
> an implementation that seeks around a file, piecing together blocks from
> various locations in a PDB file.  When a call to `readBytes` spans a block
> boundary, it does multiple reads, allocates a contiguous buffer from a
> `BumpPtrAllocator` and returns a reference to that (subsequent requests for
> the same offset return from the cached allocation).  There is also
> *FileBufferThinStream* which adapts an `llvm::FileOutputBuffer` so you
> can write to a file system object.  One could easily imagine an
> implementation that adapts `llvm::MemoryBuffer` so that you could read and
> write from mmap'ed files.
>
> But all of these just allow reading and writing raw bytes.  To handle
> reading and writing semantic data, there are two additional classes.
> *ThinStreamReader* and *ThinStreamWriter.*
>
> These accept any subclass of `ThinStream` and maintain an offset and allow
> you to read integers in any endianness, strings, objects, arrays of objects
> (both fixed length records and variable length records), and various other
> things.
>
> Finally, there are *ThinStreamRef* and *WritableThinStreamRef* which you
> can think of as `ArrayRef` and `MutableArrayRef` for ThinStreams.  They
> allow slicing, dropping, etc and copy-semantics so that you can easily pass
> streams around without worrying about reference lifetime issues.
>
> To re-iterate: *When using a ThinStreamReader, copies are the exception,
> not the rule and for most implementations of ThinStream will never occur.*
> Suppose you have a struct:
>
> struct Header {
>   char Magic[48];
>   ulittle16_t A;
>   ulittle16_t B;
>   ulittle64_t C;
> };
>
> To read this using a `ThinStreamReader`, you would write this:
>
> ThinStreamReader Reader(Stream);
> const Header *H;
> if (auto EC = Reader.readObject(H))
>   return EC;
>
> and `ThinStreamReader` just reinterpret_casts the underlying bytes to your
> structure.  The same is true for null terminated strings, arrays of
> objects, and everything else.  *It is up to the user to ensure that reads
> and writes happen at proper alignments.*  (LLVM can still assert though
> if you do a misaligned read/write).  But when reading and writing records
> in binary file formats, this is not a new issue.
>
> The proposal:
> This code has been used in LLVM's PDB library for some time, and I want to
> move it up to Support.
>
> I've discussed this with some people offline.  beanz@ has expressed
> interest for some work he wants to do on libobject.
>
>
> Out of curiosity, how do we expect this to be useful in libobject?
>
> Peter
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170222/007a717d/attachment.html>