[llvm-dev] RFC: Adding llvm::ThinStream

Zachary Turner via llvm-dev llvm-dev at lists.llvm.org
Sat Feb 18 17:09:05 PST 2017

Some background:

A while back while working on code to read / write PDB files, I came up
with Yet Another Stream Abstraction.  Note that LLVM already has a few.
Off the top of my head, theres:

1) `MemoryBuffer` and its associated class hierarchy
2) `raw_ostream` and it's associated classes.
3) `DataExtractor` which is used for reading from a StringRef.

There's probably more, and indivdiual subprojects might have even created
their own.

The reason I couldn't use any of these and needed to invent another is
because PDB files are not laid out contiguously in memory.  You can think
of it as a file system where there is an MFT that defines the blocks that
individual files live on, and then you have to re-sequence the blocks in
order to read out a "file".

To add to the complexity, each "file" inside of here is actually a list of
variable length records where records, or even individual fields of records
might cross a block boundary and require multiple reads to piece together.
I needed a way to view these streams as being contiguous, and reading data
out of them allowing all the magic of broken fields and records,
discontiguous records, etc to just disappear behind the interface.  Also, I
needed the ability to read and write using a single interface without
worrying about the details of the block structure.  None of LLVM's existing
abstractions provided convenient mechanisms for writing this kind of data.

So I came up with the following set of abstractions:

*ThinStream* - An abstract base class that provides read-only access to
data.  The interface is:
  virtual Error readBytes(uint32_t Offset, uint32_t Size, ArrayRef<uint8_t>
&Buffer) const = 0;
  virtual Error readLongestContiguousChunk(uint32_t Offset,
ArrayRef<uint8_t> &Buffer) const = 0;
  virtual uint32_t getLength() const = 0;

An important distinction between `ThinStream` and existing stream
implementations is that API encourages implementations to be *zero copy*.
Instead of giving it a buffer to write into, it just returns you a slice of
the existing buffer.  This makes it *very efficient*.  Similar to
`ArrayRef` / `MutableArrayRef`, I also provide *WritableThinStream* for
cases where your data is not read-only.  This is another area where
functionality is provided that was not present in existing abstractions
(i.e. writeability).

I have several implementations of this class and some abstractions for
working with them.  *ByteStream* and *MutableByteStream* provide a concrete
implementatino where the backing store is an `ArrayRef` or
`MutableArrayRef`.  *MappedBlockStream* (which is PDB specific) provides an
implementation that seeks around a file, piecing together blocks from
various locations in a PDB file.  When a call to `readBytes` spans a block
boundary, it does multiple reads, allocates a contiguous buffer from a
`BumpPtrAllocator` and returns a reference to that (subsequent requests for
the same offset return from the cached allocation).  There is also
*FileBufferThinStream* which adapts an `llvm::FileOutputBuffer` so you can
write to a file system object.  One could easily imagine an implementation
that adapts `llvm::MemoryBuffer` so that you could read and write from
mmap'ed files.

But all of these just allow reading and writing raw bytes.  To handle
reading and writing semantic data, there are two additional classes.
*ThinStreamReader* and *ThinStreamWriter.*

These accept any subclass of `ThinStream` and maintain an offset and allow
you to read integers in any endianness, strings, objects, arrays of objects
(both fixed length records and variable length records), and various other

Finally, there are *ThinStreamRef* and *WritableThinStreamRef* which you
can think of as `ArrayRef` and `MutableArrayRef` for ThinStreams.  They
allow slicing, dropping, etc and copy-semantics so that you can easily pass
streams around without worrying about reference lifetime issues.

To re-iterate: *When using a ThinStreamReader, copies are the exception,
not the rule and for most implementations of ThinStream will never occur.*
Suppose you have a struct:

struct Header {
  char Magic[48];
  ulittle16_t A;
  ulittle16_t B;
  ulittle64_t C;

To read this using a `ThinStreamReader`, you would write this:

ThinStreamReader Reader(Stream);
const Header *H;
if (auto EC = Reader.readObject(H))
  return EC;

and `ThinStreamReader` just reinterpret_casts the underlying bytes to your
structure.  The same is true for null terminated strings, arrays of
objects, and everything else.  *It is up to the user to ensure that reads
and writes happen at proper alignments.*  (LLVM can still assert though if
you do a misaligned read/write).  But when reading and writing records in
binary file formats, this is not a new issue.

The proposal:
This code has been used in LLVM's PDB library for some time, and I want to
move it up to Support.

I've discussed this with some people offline.  beanz@ has expressed
interest for some work he wants to do on libobject.  It can also replace a
few thousand lines of (untested) code in LLDB that does essentially the
same thing.  I suspect it can be used anytime anyone is reading from or
writing to a binary file format, perhaps even being faster than existing
implementations due to the zero-copy aspect.

I have a (somewhat large) patch locally that gets LLVM working with the
code raised up to Support.  If you want to look at the existing
implementation, the following files are relevant:

// Files that would move up to Support

// Files that would remain PDB specific

(In the existing implementation, Thin is not used in the class names)

The code is lacking doxygen style comments, but I would add complete
documentation as the result of any move.

Questions / comments welcome.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170219/429cf119/attachment-0001.html>

More information about the llvm-dev mailing list