[PATCH] D43341: [clang-doc] Implement reducer portion of the frontend framework

Sam McCall via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Tue Apr 17 02:57:37 PDT 2018


sammccall added a comment.

OK, I didn't get through all the detail, but we should talk about the MR stuff today :)



================
Comment at: clang-doc/BitcodeReader.cpp:25
+void ClangDocBitcodeReader::storeData(SymbolID &Field, llvm::StringRef Blob) {
+  assert(Record[0] == 20);
+  // First position in the record is the length of the following array, so we
----------------
why is this true?
i.e. why can't I produce a malformed bitcode file that causes this assertion to fail?
(My usual understanding is that a failed assert is meant to indicate a programming error, if it means something else here like "malformed file" you should probably document that if you're sure it's what you want)


================
Comment at: clang-doc/BitcodeReader.cpp:42
+                                      llvm::StringRef Blob) {
+  Field = (AccessSpecifier)Record[0];
+}
----------------
Similarly, what guarantees Record[0] is in-range for AccessSpecifier? casting an out-of-range integer to an enum results in UB.


================
Comment at: clang-doc/BitcodeReader.h:11
+// This file implements a reader for parsing the clang-doc internal
+// representation to LLVM bitcode. The reader takes in a stream of bits and
+// generates the set of infos that it represents.
----------------
nit: *from* LLVM bitcode, or am I missing something?


================
Comment at: clang-doc/BitcodeReader.h:29
+// Class to read bitstream into an InfoSet collection
+class ClangDocBitcodeReader {
+public:
----------------
I found the implementation here a bit hard to understand at first, mostly because the state/invariants/responsibilities of each method weren't obvious to me.
Left some suggestions below to document these a bit more.

However many of these seem to in practice only share the Record buffer. It seems making this an explicit parameter would let you make these free functions, hiding them from the header and avoiding any possibility of spooky side-channel interactions. Up to you of course.


================
Comment at: clang-doc/BitcodeReader.h:46
+
+  // Reading records
+  template <typename TInfo> bool readRecord(unsigned ID, TInfo &I);
----------------
I think this needs some elaboration :-)
e.g. Populates Record with the decoded record, and then calls one of the `parseRecord` overloads.

In general the data flow between these methods seems a little obscured by having Record be a field rather than a parameter as ID/Blob are.
(Incidentally I don't think you gain anything here from the use of SmallVector over std::vector, which is easier to type, but it doesn't really matter)


================
Comment at: clang-doc/BitcodeReader.h:49
+
+  bool parseRecord(unsigned ID, llvm::StringRef Blob, unsigned VersionNo);
+  bool parseRecord(unsigned ID, llvm::StringRef Blob, NamespaceInfo &I);
----------------
comment for these methods e.g. Populates the appropriate field of the Info object based on a decoded record whose content is in Record.


================
Comment at: clang-doc/BitcodeReader.h:59
+
+  void storeData(llvm::SmallVectorImpl<char> &Field, llvm::StringRef Blob);
+  void storeData(bool &Field, llvm::StringRef Blob);
----------------
Comment for these like "decodes the blob part of a record" - or just rename these `decodeBlob` or similar?


================
Comment at: clang-doc/Reducer.cpp:17
+
+std::unique_ptr<InfoSet> mergeInfos(tooling::ToolResults *Results) {
+  std::unique_ptr<InfoSet> UniqueInfos = llvm::make_unique<InfoSet>();
----------------
A little bit of context to document here: the fact that ToolResults values are bit-encoded InfoSets describing each TU in the codebase. (I think?)

(Also, why not return InfoSet by value?)


================
Comment at: clang-doc/Reducer.cpp:23
+    ClangDocBitcodeReader Reader(Stream);
+    if (!Reader.readBitstream(*UniqueInfos)) {
+      Err = true;
----------------
I don't see where any actual merging is happening.
Say we process TU 1 which has a decl of foo() with a doc comment, and TU 2 which defines foo(). Now we're going to have something like:
ToolResults = {
 "tu1": "...bitcode...foo().../*does foo*/",
 "tu2": "...bitcode...foo()...Foo.cpp:53",
}
the first readBitstream call is going to ultimately call StoreData(UniqueInfos->Functions), which appends a new function to the array
then the second readBitstream call will do the same thing, and you have an InfoSet with two copies of foo()

what am I missing?


================
Comment at: clang-doc/Reducer.h:21
+// Combine occurrences of the same info across translation units.
+std::unique_ptr<InfoSet> mergeInfos(tooling::ToolResults *Results);
+
----------------
This doesn't quite look like the typical reduce pattern. (And there's no framework or guidance for MRs in libtooling so this isn't surprising or maybe even a problem.

When the mapper emits its values, it usually chooses the keys so that a non-trivial merge is only required among values with the same key. (e.g. USR here)
This does two things:
1. it makes it really easy to parallelize: after the mappers run in parallel, the data is grouped by key and all the reducers can run in parallel
2. it simplifies the reducer code a lot, as it only has to deal with a single entity (e.g. function) at a time.

So conceptually instead of being a function(vector<InfoSet>) --> InfoSet, it's a function(USR, vector<FunctionInfo>) --> FunctionInfo.
The signature you've got here doesn't actually preclude 1), but it's unclear whether you expect it to be used this way. (i.e. do all the InfoSets being passed here actually contain only a single, identical entity?).


That said - does this actually matter? If you want to scale to large codebases, loading all your Infos into one InfoSet in memory is something to avoid. It's important to be able to process the symbols separately, both for parallelism and to avoid atomic tasks that are too big for one machine.
If you want to do this, then modelling the data unit as "one symbol of some kind" rather than "an arbitrary set of symbols" is much more natural. You no longer need "what is this symbol" in the value and to decode it over and over, instead you put it in the key.
So conceptually you have something like:
key="FUNCTION/::foo(int)", value = "COMMENT=does foo;PARAMS=[length]"
Your reducer merges the values for the same key without having to touch the key, and then the final documentation generator can run over the values inde


If you don't care about big codebases, I can't think of a strong reason to care. I like the conceptual framework of MR, but it's not necessary.


================
Comment at: clang-doc/Representation.h:135
   Info() = default;
-  Info(Info &&Other) : Description(std::move(Other.Description)) {}
-  virtual ~Info() = default;
+  Info(Info &&Other)
+      : USR(Other.USR), Name(Other.Name), Namespace(std::move(Other.Namespace)),
----------------
juliehockett wrote:
> lebedev.ri wrote:
> > Why is the move constructor explicitly defined?
> > Unless i'm missing something, the default one would do exactly the same.
> > https://godbolt.org/g/XqsRuX
> It's there (as is the CommentInfo one) because otherwise it was throwing errors. It appears that a copy constructor is called somewhere if this isn't specified.
you can define the move constructor as `= default`, and copy as `= delete`. The intent is clear, but you save some typing.


================
Comment at: clang-doc/Representation.h:207
 
+class InfoSet {
+public:
----------------
This class is central, has a vague name, and is undocumented :-)

I gather this is the collection that describes the whole codebase. Something like CodebaseInfo might be a little more descriptive (though also misleading, as it doesn't inherit Info). But the name is OK as long as it's documented.


================
Comment at: clang-doc/Representation.h:220
+  /// Populate the type references in all infos
+  void createReferences();
+
----------------
So this implementation relies on InfoIndex and having the whole codebase in one InfoSet in memory.
If you want to scale past having everything in-memory, this needs to be part of the MR too.

Since you're at euroLLVM, let's find a chance to chat about it today, we should make a decision about whether "all fits in memory" is a simplifying assumption you want.


https://reviews.llvm.org/D43341





More information about the cfe-commits mailing list