<div dir="ltr">Haven't got to this but would like to take a look/review it before it goes in.<br><br>*skimming over some of the description*<br><br>Sounds like 'stream' might not be the right terminology - since they return pointers into data that (I think) remains valid for the life of the stream? (this also makes me wonder a bit about memory usage if the cross-block operation is used a lot (causing allocations) but the values are then discarded by the user - the memory can't be reused, it's effectively leaked)<br><br>Also - the wrappers ThinStreamReader/Writer - do they benefit substantially from being thin-stream aware, rather than abstractions around any stream of bytes? (since they transform the data anyway)<br>Oh, reinterpret casting... hrm. That kind of file reading/writing scheme usually makes me a bit uncomfortable due to portability concerns (having to align, byte swap, etc, structs to match the on-disk format can make those structures problematic to work with - if you have to byte swap anyway, you'd need to copy the data out of the underlying buffer anyway, right?)</div><br><div class="gmail_quote"><div dir="ltr">On Wed, Feb 22, 2017 at 10:13 AM Zachary Turner via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr" class="gmail_msg">I take this as no objections. I've changed the name from ThinStream to BinaryStream as it more accurately conveys what it is used for, and I've got the tests and comments mostly ready to go, so I'll commit this later today if there's no objections?</div><br class="gmail_msg"><div class="gmail_quote gmail_msg"><div dir="ltr" class="gmail_msg">On Sat, Feb 18, 2017 at 5:09 PM Zachary Turner <<a href="mailto:zturner@google.com" class="gmail_msg" target="_blank">zturner@google.com</a>> wrote:<br class="gmail_msg"></div><blockquote class="gmail_quote gmail_msg" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr" class="gmail_msg"><div class="gmail_msg">Some background:</div><div class="gmail_msg"><br class="gmail_msg"></div>A while back while working on code to read / write PDB files, I came up with Yet Another Stream Abstraction. Note that LLVM already has a few. Off the top of my head, theres:<div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">1) `MemoryBuffer` and its associated class hierarchy</div><div class="gmail_msg">2) `raw_ostream` and it's associated classes.</div><div class="gmail_msg">3) `DataExtractor` which is used for reading from a StringRef.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">There's probably more, and indivdiual subprojects might have even created their own.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">The reason I couldn't use any of these and needed to invent another is because PDB files are not laid out contiguously in memory. You can think of it as a file system where there is an MFT that defines the blocks that individual files live on, and then you have to re-sequence the blocks in order to read out a "file".</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">To add to the complexity, each "file" inside of here is actually a list of variable length records where records, or even individual fields of records might cross a block boundary and require multiple reads to piece together. I needed a way to view these streams as being contiguous, and reading data out of them allowing all the magic of broken fields and records, discontiguous records, etc to just disappear behind the interface. Also, I needed the ability to read and write using a single interface without worrying about the details of the block structure. None of LLVM's existing abstractions provided convenient mechanisms for writing this kind of data.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">So I came up with the following set of abstractions:</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg"><b class="gmail_msg">ThinStream</b> - An abstract base class that provides read-only access to data. The interface is:</div><div class="gmail_msg"> virtual Error readBytes(uint32_t Offset, uint32_t Size, ArrayRef<uint8_t> &Buffer) const = 0;</div><div class="gmail_msg"><div class="gmail_msg"> virtual Error readLongestContiguousChunk(uint32_t Offset, ArrayRef<uint8_t> &Buffer) const = 0;</div><div class="gmail_msg"> virtual uint32_t getLength() const = 0;<br class="gmail_msg"></div></div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">An important distinction between `ThinStream` and existing stream implementations is that API encourages implementations to be <b class="gmail_msg">zero copy</b>. Instead of giving it a buffer to write into, it just returns you a slice of the existing buffer. This makes it <b class="gmail_msg">very efficient</b>. Similar to `ArrayRef` / `MutableArrayRef`, I also provide <b class="gmail_msg">WritableThinStream</b> for cases where your data is not read-only. This is another area where functionality is provided that was not present in existing abstractions (i.e. writeability).</div><div class="gmail_msg"><b class="gmail_msg"><br class="gmail_msg"></b></div><div class="gmail_msg">I have several implementations of this class and some abstractions for working with them. <b class="gmail_msg">ByteStream</b> and <b class="gmail_msg">MutableByteStream</b> provide a concrete implementatino where the backing store is an `ArrayRef` or `MutableArrayRef`. <b class="gmail_msg">MappedBlockStream</b> (which is PDB specific) provides an implementation that seeks around a file, piecing together blocks from various locations in a PDB file. When a call to `readBytes` spans a block boundary, it does multiple reads, allocates a contiguous buffer from a `BumpPtrAllocator` and returns a reference to that (subsequent requests for the same offset return from the cached allocation). There is also <b class="gmail_msg">FileBufferThinStream</b> which adapts an `llvm::FileOutputBuffer` so you can write to a file system object. One could easily imagine an implementation that adapts `llvm::MemoryBuffer` so that you could read and write from mmap'ed files.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">But all of these just allow reading and writing raw bytes. To handle reading and writing semantic data, there are two additional classes. <b class="gmail_msg">ThinStreamReader</b> and <b class="gmail_msg">ThinStreamWriter.</b></div><div class="gmail_msg"><b class="gmail_msg"><br class="gmail_msg"></b></div><div class="gmail_msg">These accept any subclass of `ThinStream` and maintain an offset and allow you to read integers in any endianness, strings, objects, arrays of objects (both fixed length records and variable length records), and various other things.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">Finally, there are <b class="gmail_msg">ThinStreamRef</b> and <b class="gmail_msg">WritableThinStreamRef</b> which you can think of as `ArrayRef` and `MutableArrayRef` for ThinStreams. They allow slicing, dropping, etc and copy-semantics so that you can easily pass streams around without worrying about reference lifetime issues.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">To re-iterate: <b class="gmail_msg">When using a ThinStreamReader, copies are the exception, not the rule and for most implementations of ThinStream will never occur.</b> Suppose you have a struct:</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">struct Header {</div><div class="gmail_msg"> char Magic[48];</div><div class="gmail_msg"> ulittle16_t A;</div><div class="gmail_msg"> ulittle16_t B; <br class="gmail_msg"></div><div class="gmail_msg"> ulittle64_t C;</div><div class="gmail_msg">};<br class="gmail_msg"></div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">To read this using a `ThinStreamReader`, you would write this:</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">ThinStreamReader Reader(Stream);</div><div class="gmail_msg">const Header *H;</div><div class="gmail_msg">if (auto EC = Reader.readObject(H))</div><div class="gmail_msg"> return EC;</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">and `ThinStreamReader` just reinterpret_casts the underlying bytes to your structure. The same is true for null terminated strings, arrays of objects, and everything else. <b class="gmail_msg">It is up to the user to ensure that reads and writes happen at proper alignments.</b> (LLVM can still assert though if you do a misaligned read/write). But when reading and writing records in binary file formats, this is not a new issue.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">The proposal:</div><div class="gmail_msg">This code has been used in LLVM's PDB library for some time, and I want to move it up to Support. </div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">I've discussed this with some people offline. beanz@ has expressed interest for some work he wants to do on libobject. It can also replace a few thousand lines of (untested) code in LLDB that does essentially the same thing. I suspect it can be used anytime anyone is reading from or writing to a binary file format, perhaps even being faster than existing implementations due to the zero-copy aspect.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">I have a (somewhat large) patch locally that gets LLVM working with the code raised up to Support. If you want to look at the existing implementation, the following files are relevant:</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">// Files that would move up to Support</div><div class="gmail_msg">include/DebugInfo/MSF/ByteStream.h <br class="gmail_msg"></div><div class="gmail_msg">include/DebugInfo/MSF/StreamInterface.h</div><div class="gmail_msg">include/DebugInfo/MSF/StreamReader.h</div><div class="gmail_msg">include/DebugInfo/MSF/StreamWriter.h<br class="gmail_msg"></div><div class="gmail_msg">include/DebugInfo/MSF/StreamArray.h<br class="gmail_msg"></div><div class="gmail_msg">include/DebugInfo/MSF/StreamRef.h <br class="gmail_msg"></div><div class="gmail_msg">lib/DebugInfo/MSF/StreamReader.cpp</div><div class="gmail_msg">lib/DebugInfo/MSF/StreamWriter.cpp<br class="gmail_msg"></div><div class="gmail_msg">lib/DebugInfo/MSF/StreamRef.cpp<br class="gmail_msg"></div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">// Files that would remain PDB specific</div><div class="gmail_msg">include/DebugInfo/MSF/MappedBlockStream.h<br class="gmail_msg"></div><div class="gmail_msg">lib/DebugInfo/MSF/MappedBlockStream.cpp<br class="gmail_msg"></div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">(In the existing implementation, Thin is not used in the class names)</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">The code is lacking doxygen style comments, but I would add complete documentation as the result of any move.</div><div class="gmail_msg"><br class="gmail_msg"></div><div class="gmail_msg">Questions / comments welcome.</div></div></blockquote></div>
_______________________________________________<br class="gmail_msg">
LLVM Developers mailing list<br class="gmail_msg">
<a href="mailto:llvm-dev@lists.llvm.org" class="gmail_msg" target="_blank">llvm-dev@lists.llvm.org</a><br class="gmail_msg">
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" class="gmail_msg" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br class="gmail_msg">
</blockquote></div>