[PATCH] D41102: Setup clang-doc frontend framework

Thu Feb 1 00:40:29 PST 2018

sammccall added a comment.

In https://reviews.llvm.org/D41102#994311, @juliehockett wrote:

> I'm going to take a stab at refactoring the serialization part of this next -- rather than keeping it all in memory and dumping it at the end, it should serialize as it goes and do some sort of reduce at the end. I'm currently thinking of writing it out to a file as it goes, and then reading and reducing from there -- thoughts?

Sounds awesome :-) Some thoughts on structure (sorry for the rant - clangd's indexer has a similar problem so I've been thinking about this).

This neatly separates the problem out into

- "business logic" for map and reduce - these are basically pure functions
- imperative frameworky stuff: serializing/deserializing, grouping map keys together, running threadpools

Sadly we don't have a generic implementation of the frameworky part in libtooling I think, so you'll probably end up writing the bits you need in clang-doc.
There is `ExecutionContext` in Tooling/Execution.h, which allows you to report KV pairs, and loop over them afterwards, which models map output (must be serialized as strings).

Still I think there's value in separating out the two layers:

- conceptual clarity
- testing (the pure functions are nice to test)
- one can hook up the mapper/reducer to hadoop, or google's internal infrastructure, or ...
- it can inspire and benefit from someone writing a reusable upstream MR API

So maybe the API could look something roughly like:

`newFrontendActionFactory(DocSink)` returns clang-doc's extractor part (the mapper, i.e. most of this patch)
`DocSink` is a class with a bunch of methods like `emitNamespaceInfo(StringRef Name, NSInfo)`. This adapts to the MR machinery: it could write to ExecutionContext, or files on disk as you suggest.
Merging provided via classes like `class MergeNS { MergeNS(StringRef Name); void add(NSInfo); NSInfo finish(); }` These would be invoked by the MR machinery (for now, just code that iterates over the ExecutionContext or files on disk).

In this case, you could test the mapper in isolation by writing a small tool that just runs the mapper over some files, gathers the result, and dumps it as YAML. The reducer and the MR driver wouldn't need to be part of this patch.

Anyway, of course you may want to take this in a different direction - just some ideas.

================
Comment at: tools/clang-doc/ClangDoc.h:29
+struct ClangDocContext {
+  // Which format in which to emit representation.
+  OutFormat EmitFormat;
----------------
juliehockett wrote:
> sammccall wrote:
> > Is this the intermediate representation referred to in the design doc, or the final output format?
> > 
> > If the former, why two formats rather than picking one?
> > YAML is nice for being usable by out-of-tree tools (though not as nice as JSON). But it seems like providing YAML as a trivial backend format would fit well?
> > Bitcode is presumably more space-efficient - if this is significant in practice it seems like a better choice.
> That's the idea -- for developing purposes, I wrote up the YAML output first for this patch, and there will be a follow-on patch expanding the bitcode/binary output. I've updated the flags to default to the binary, with an option to dump the yaml (rather than the other way around).
What's still not clear to me is: is YAML a) a "real" intermediate format, or b) just a debug representation?

I would suggest for orthogonality that there only be one intermediate format, and that any debug version be generated from it. In practice I guess this means:
 - the reporter builds the in-memory representation
 - you can serialize/deserialize memory representation to the IR (bitcode)
 - you can serialize memory representation to debug representation (YAML) but not parse
 - maybe the clang-doc core should *only* know about IR, and YAML should be produced in the same way e.g. HTML would be?

This does pose a short-term problem: the canonical IR is bitcode, we need YAML for the lit tests, and we don't have the decoder/transformer part yet. This could be solved either by using YAML as the IR *for now* and switching later, or by adding a simple decoder now.
Either way it points to the *reporter* not having an output format option, and having to support two formats.

WDYT? I might be missing something here.

================
Comment at: tools/clang-doc/ClangDoc.h:33
+
+class ClangDocVisitor : public RecursiveASTVisitor<ClangDocVisitor> {
+public:
----------------
juliehockett wrote:
> jakehehrlich wrote:
> > sammccall wrote:
> > > This API makes essentially everything public. Is that the intent?
> > > 
> > > It seems like `ClangDocVisitor` is a detail, and the operation you want to expose is "extract doc from this AST into this reporter" or maybe "create an AST consumer that feeds this reporter".
> > > 
> > > It would be useful to have an API to extract documentation from individual AST nodes (e.g. a Decl). But I'd be nervous about trying to use the classes exposed here to do that. If it's efficiently possible, it'd be nice to expose a function.
> > > (one use case for this is clangd)
> > Correct me if I'm wrong but I believe that everything needs to be public in this case because the base class needs to be able to call them. So the visit methods all need to be public.
> Yes to the `Visit*Decl` methods being public because of the base class.
> 
> That said, I shifted a few things around here and implemented it as a `MatcherFinder` instead of a `RecursiveASTVisitor`. The change will allow us to make most of the methods private, and have the ability to fairly easily implement an API for pulling a specific node (e.g. by name or by decl type). As far as I understand (and please correct me if I'm wrong), the matcher traverses the tree in a similar way. This will also make mapping through individual nodes easier.
Sorry for being vague - yes overridden or "CRTP-overridden" methods may need to be public.
I meant that the classes themselves don't need to be exposed, I think. (The header could just expose a function to create the needed ones, that returns `unique_ptr<interface>`

There are now fewer classes exposed here, but I think most/all of them can still reasonably be hidden.

https://reviews.llvm.org/D41102