[PATCH] D111414: [Demangle] Add minimal support for D programming language

Fri Oct 15 08:18:39 PDT 2021

ljmf00 added inline comments.

================
Comment at: llvm/lib/Demangle/DLangDemangle.cpp:36
+/// \note Beware these aren't required to be '\0' terminated
+struct OutputString {
+
----------------
dblaikie wrote:
> jhenderson wrote:
> > ljmf00 wrote:
> > > ljmf00 wrote:
> > > > dblaikie wrote:
> > > > > dblaikie wrote:
> > > > > > ljmf00 wrote:
> > > > > > > jhenderson wrote:
> > > > > > > > ljmf00 wrote:
> > > > > > > > > dblaikie wrote:
> > > > > > > > > > jhenderson wrote:
> > > > > > > > > > > dblaikie wrote:
> > > > > > > > > > > > ljmf00 wrote:
> > > > > > > > > > > > > dblaikie wrote:
> > > > > > > > > > > > > > Any chance of using an existing stream type (like `llvm::itanium_demangle::OutputStream` which `RustDemangle` also uses?)? Otherwise might be worth a bunch of separate testing of this class - or incrementally adding functionality to it that functionality is used in the patch series, otherwise it's hard to tell that everything's tested if it's added in one go here & significant parts are currently unused.
> > > > > > > > > > > > > This type differs from a stream because D demangler needs the ability to prepend to the output string due to how D demangling is designed. Because of that, a stream is not a good fit here and plus adding methods like prepend will make it conceptually not a stream at all.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I will incrementally add parts of it when necessary and add tests for it. Although, if you find any other data structure that might be suitable here, please let me know.
> > > > > > > > > > > > Ah, fair enough. Might want to check the naming conventions (I'd expect "need" to be called "reserve" in C++ API parlance, though perhaps in LLVM's APIs that's called "grow"?)/API design (do the other similar data structures have a "free()" function, or do they rely on the dtor to cleanup?) line up with the existing OutputStream/other data structures in LLVM for consistency.
> > > > > > > > > > > > 
> > > > > > > > > > > > & maybe worth pulling it out into a separate header - given how big this whole file is likely to get?
> > > > > > > > > > > I wonder if we could model this as some kind of `deque<std::string>` - basically buffer things by adding strings to the start or end of the queue (prepend and append), before finalizing it into a single string at a later point. It might help avoid a lot of manual memory management/copying etc.
> > > > > > > > > > Probably would be a `deque<char>`? Though that'd then involve copying the result into the output buffer - so I think the current approach is probably a/the good one, and consistent with the other demanglers that write directly into their output buffer. Just with the added constraint of needing to be able to prepend info.
> > > > > > > > > > 
> > > > > > > > > > (another alternative would be to generalize the existing itanium OutputStream to support prepending too)
> > > > > > > > > I think a deque is not a good fit for this for several reasons. I think you were referring to some sort of `deque<char*>`, otherwise it would be very memory inefficient -- the deque node structs are way heavier than a byte.
> > > > > > > > > 
> > > > > > > > > Even with a string deque memory not being contiguous across nodes can also hurt performance, when, e.g. calculating the length of a given deque. Another problem with that is the fact that memory allocation is done in chunks and gown twice when needed.
> > > > > > > > > 
> > > > > > > > > Also, assuming that a string is appended/prepended to the deque on every OutputString::append or OutputString::prepend, allocating small chunks individually is also incredibly slow -- if that is not the case I'm not seeing a way where it is applicable at all.
> > > > > > > > > 
> > > > > > > > > Talking about the OutputStream, as I mentioned above, prepending is not part of a Stream concept (correct me if I'm wrong), since a stream is designed to only grow at the end. I'm not quite sure if something is relying on specific characteristics of a stream but, e.g. because a stream only grows at the end, adding a prepend would mess up with indices. If that is not a problem, I would consider that option but also renaming it.
> > > > > > > > Yeah, char* or string is what I was thinking, since if you don't need to go back and modify any of the already added string, you can just naturally add substrings at either end. You could also try a simple `std::list`, since that would avoid the overhead of reserving space completely. I've not really studied the performance implications of the various std containers though, so I can't really advise what's the most appropriate overall. I'm more interested in the practicality of writing and maintaining the code (don't reinvent the wheel and all that, which I feel this class more-or-less is doing).
> > > > > > > > 
> > > > > > > > I'm not sure when you need to calculate the overall length, so I'm not sure that aspect is relevant?
> > > > > > > I thought wrongly about deque. Probably deque<char> is desirable compared to deque<char*> or deque<string> since it has the same behaviour of std::vector but with amortized constant insertions at the beginning or end of the nodes. I still don't see good fit because of what I will describe bellow:
> > > > > > > 
> > > > > > > About the length, `getLength()` is being used on some parts of the code. The length on a deque is O(N), being N the number of nodes, so it is certainly less performant, unless I rely on a manual counter. The same applies with std::list, with the risk of having even more nodes on std::list, since it is internally implemented with a double linked list instead of individually allocated arrays. The only benefit std::list brings is the constant insertion in the middle. Also copying continuous memory in chunks can be way faster with optimized SIMD `memcpy`.
> > > > > > > 
> > > > > > > I read the code of `OutputStream` and it seems reasonable to me to use it, as it has a very similar approach to this implementation. The only major thing I really see is prepending. I can add the prepend method but my question remains the same, is it really a stream or should we consider renaming it? Is there any implication renaming it, if that is the case? If that is not possible, I thought about extending OutputString from OutputStream making that possible.
> > > > > > If we were going to move away from a custom data structure, to something that'll require copying at the end of the demangling (to get it into a standalone malloc'd buffer to return from the API) I'd probably suggest just using `std::string` - it's got the same general performance characteristics (eg: O(N) on prepend) as the current code, and is simple/direct, doesn't introduce potential other issues like memory fragmentation, etc.
> > > > >  Re: OutputStream: Yeah, I think it's probably fine to rename it (`OutputBuffer`, `StringBuffer`, `OutputStringBuffer`? It's already in a (perhaps overly specific (itanium, even though it's already used for Rust and now potentially for D) namespace, so doesn't have to include a "demangled" or something else to disambiguate it) - I'm OK with any of the suggested names, including `OutputString`. I guess `OutputBuffer` is closer to the original name, while removing the "only adds to the end" semantics carried by the word "stream".
> > > > > 
> > > > > Probably rename it in one patch, add the needed prepend in whatever patch is using it shortly thereafter.
> > > > The problem I mentioned earlier about `std::string`, and addressed by `OutputStream` and this `OutputString` implementation, is the fact that the `grow()`/`need()` expands twice the current size + requested size as opposed to the `reserve()` method in `std::string` (See here for more context: https://github.com/llvm/llvm-project/blob/main/libcxx/include/string#L3323). This is important because appending to `OutputStream`/`OutputString` is more efficient with significant less allocations.
> > > Ok, I'm going to write a patch to add prepending and rename it accordingly to `OutputBuffer`. We can discuss further the implementation/approach there.
> > As a thought before getting too far with the reimplmentation of OutputStream, I wonder if you could model the following function:
> > ```
> > void OutputStream::prepend(const std::string &Input);
> > ```
> > as:
> > ```
> > OutputStream prependToStream(const std::string &Input, OutputStream &&stream);
> > ```
> > In other words, create a new stream that just has the input contents, and then move the contents of the old stream (including any memory allocation etc) into the newly created stream somehow.
> > 
> > I've not thought about performance implications too much, so this may be a terrible idea due to additional copying involved somewhere along the lines.
> I think this could probably be done efficiently/without any overhead - probably the missing piece would be the ability to pass in a malloc'd buffer to OutputStream's ctor. Then you could take it from one OutputStream, do whatever you want with it (prepend, etc) then construct a new OutputStream with it.
> 
> But I'd lean towards considering that not better than adding the functionality directly to OutputStream (possibly with the rename).
Yeah, there is nothing less efficient on that approach, although I think it could lead to confusion. If it is not appropriate to have it inside, having it outside is questionable for other people that might use it. This can be solved by documenting the behaviour properly, but I'm also inclined to renaming.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D111414/new/

https://reviews.llvm.org/D111414