[PATCH] D83814: [clangd] Add Random Forest runtime for code completion.

Wed Sep 16 07:52:37 PDT 2020

sammccall added a comment.

Just some naming and doc nits. This looks really solid now, nice job!

In D83814#2261458 <https://reviews.llvm.org/D83814#2261458>, @jkorous wrote:

> Hi @usaxena95 and @sammccall,
>
> I am wondering about couple high-level things.
>
> Do you guys intend to open-source also the training part of the model pipeline or publish a model trained on generic-enough training set so it could be reasonably used on "any" codebase?

@adamcz and I were talking about this too... I think it's important we do as much of this as possible. I was the one not finding time to do it though, and I think Adam may do better :-)

- the existing training stuff is using internal tech, but AFAIK it's nothing LightGBM can't do (it trains on a single machine). So we should be able to open-source the training setup and actually use that.
- training data generation is harder to open, because it involves sampling a large diverse body of code and parsing lots of it. The core code that embeds clangd and extracts completion candidates should be very doable, so one could run over LLVM on one machine. The framework to run at a larger scale is coupled to internal infra though, and we're currently training on both public and non-public code.

================
Comment at: clang-tools-extra/clangd/quality/CompletionModel.cmake:10
+
+  set(model_json ${model}/forest.json)
+  set(model_features ${model}/features.json)
----------------
these vars are used only once, I'd suggest inlining them for readability

================
Comment at: clang-tools-extra/clangd/quality/CompletionModel.cmake:13
+  
+  set(output_dir ${CMAKE_BINARY_DIR}/generated/decision_forest)
+  set(header_file ${output_dir}/${filename}.h)
----------------
/generated/decision_forest seems redundant considering ${CMAKE_BINARY_DIR} is already the generated-files tree for the directory of the calling CMakeLists.

Can't we just use ${CMAKE_BINARY_DIR} directly and avoid the DECISION_FOREST_OUTPUT_DIR variable?

================
Comment at: clang-tools-extra/clangd/quality/CompletionModelCodegen.py:150
+    # Class definition.
+    code += f"class {cpp_class.name} {{\n"
+    code += "public:\n"
----------------
you can use triple-quoted f-strings, which I think would be more readable than blocks of "code +="

```
code += f"""class {cpp_class.name} {{
public:
  {"\n   ".join(setters)}

private:
  {"\n  ".join(class_members)}
"""
```

etc

may even do the whole thing in one go.

================
Comment at: clang-tools-extra/clangd/quality/CompletionModelCodegen.py:217
+
+    code = "\n".join(f'#include <{h}>' for h in angled_include) + "\n\n"
+    code += "\n".join(f'#include "{h}"' for h in quoted_include) + "\n\n"
----------------
again, making this one big triple-quoted f-string may be nicer, up to you

================
Comment at: clang-tools-extra/clangd/quality/README.md:4
+## Decision Forest
+A **decision forest** is a collection of many decision trees. A **decision tree** is a full binary tree where every non-leaf node has exactly **2** children. 
+
----------------
The second half of this sentence simply restates the first.

Maybe we can combine this with the second paragraph: "A decision tree is a full binary tree that provides a quality prediction for an input (code completion item). Internal nodes represent a binary decision based on the input data, and leaf nodes represent a prediction."

================
Comment at: clang-tools-extra/clangd/quality/README.md:8
+
+At every non-leaf node, we evaluate the condition present in the node.  The condition refers to exactly one **feature**. It uses the value of this attribute from the code completion item to evaluate the condition.
+Based on the condition, we move to its true child or the false child.
----------------
Nit: I think it's worth separating out defining features vs conditions.

e.g.
"An input (code completion candidate) is characterized as a set of **features**, such as the type of symbol or the number of existing references

At every non-leaf node, we evaluate the condition to decide whether to go left or right. The condition compares one feature of the input against a constant. It is either:
...".

================
Comment at: clang-tools-extra/clangd/quality/README.md:16
+A leaf node only contains the value **score**.
+Once we know the set of leaves (one from each decision tree), we add the **score** values from each of the leaves to get the final relevance score.
+
----------------
nit: rather than alternating between describing traversing all trees and one tree, I'd just say "To compute an overall quality score, we traverse each tree in this way and add up the scores".

================
Comment at: clang-tools-extra/clangd/quality/README.md:26
+  "name": "a_numerical_feature",
+  "type": "NUMBER"
+}
----------------
This is a little prone to confusion with C++ type.

Consider "kind" instead?

================
Comment at: clang-tools-extra/clangd/quality/README.md:34
+  "type": "ENUM",
+  "enum": "fully::qualified::enum",
+  "header": "path/to/HeaderDeclaringEnum.h"
----------------
this might be "type"?

================
Comment at: clang-tools-extra/clangd/quality/README.md:38
+```
+The field `enum` specifies the fully qualified name of the enum.
+
----------------
The max numeric value may not exceed 32 (is that right?)

================
Comment at: clang-tools-extra/clangd/quality/README.md:76
+## Code Generator for Inference
+The implementation of inference runtime is split across:
+- Build System (CMake)
----------------
nit: order should match order of the actual paragraphs.

This is short enough though that you might not want the mini-TOC here.

================
Comment at: clang-tools-extra/clangd/quality/README.md:98
+
+## Example
+### model/features.json
----------------
This seems like it might be a pain to maintain.
Maybe just include the json files and the public interface from DecicionForestRuntime.h?

================
Comment at: clang-tools-extra/clangd/quality/README.md:98
+
+## Example
+### model/features.json
----------------
sammccall wrote:
> This seems like it might be a pain to maintain.
> Maybe just include the json files and the public interface from DecicionForestRuntime.h?
may want to add the cmake invocation to generate the files.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D83814/new/

https://reviews.llvm.org/D83814