[PATCH] D83814: [clangd] Add Random Forest runtime for code completion.

Sam McCall via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Fri Jul 24 14:15:38 PDT 2020


sammccall added a comment.

In D83814#2166349 <https://reviews.llvm.org/D83814#2166349>, @usaxena95 wrote:

> This is still WIP and I will provide more details in the description once this is finalized.


This really should have high level documentation - I don't think Nathan will be the only one to ask these questions.
We need a description of what problem this solves, what the concepts are (trees, features, scores), inference, training, data sources, generated API.

Given that the implementation is necessarily split across CMake, generated code, generator, and data (but not a lot of hand-written C++) I think it's probably best documented in a README.md or so in the model/ dir.

Seems fine to do that in this patch, or ahead of time (before the implementation)... I wouldn't wait to do it after, it'll help with the review.

---

One question we haven't discussed is how workspace/symbol should work. (This uses the hand-tuned score function with some different logic). This is relevant to naming: it's not the "completion model" if it also has other features.



================
Comment at: clang-tools-extra/clangd/CMakeLists.txt:30
+  
+set(DF_COMPILER ${CMAKE_CURRENT_SOURCE_DIR}/CompletionModelCodegen.py)
+include(${CMAKE_CURRENT_SOURCE_DIR}/CompletionModel.cmake)
----------------
if you want the compiler script to be a parameter, make it an argument to the function rather than a magic variable. 

But really, I think the CompletionModel.cmake is tightly coupled to the python script, I think it can be hardcoded there.


================
Comment at: clang-tools-extra/clangd/CompletionModel.cmake:1
+# Run the Completion Model Codegenerator on the model in the 
+# ${model} directory.
----------------
I think there's some confusion in the naming.

You've got the code split into two locations: the generic generator and the specific code completion model.

However the directory named just "model" contains the specific stuff, and the generic parts are named "completionmodel!".

I'd suggest either:
 - don't generalize, and put everything in clangd/quality or so
 - split into clangd/quality/ and clangd/forest/ for the specific/generic parts


================
Comment at: clang-tools-extra/clangd/CompletionModel.cmake:5
+# ${CMAKE_BINARY_DIR}/generated/decision_forest. The generated header
+# will define a C++ class called ${cpp_class} - which may be a
+# namespace-qualified class name.
----------------
what does the class do?


================
Comment at: clang-tools-extra/clangd/CompletionModel.cmake:7
+# namespace-qualified class name.
+function(df_compile model fname cpp_class)  
+  set(model_json ${model}/forest.json)
----------------
df is cryptic.
decision_forest or gen_decision_forest?


================
Comment at: clang-tools-extra/clangd/CompletionModel.cmake:17
+    COMMAND "${Python3_EXECUTABLE}" ${DF_COMPILER} 
+      --model ${model}
+      --output_dir ${output_dir}
----------------
I'd suggest passing the component filenames explicitly here since you're computing them anyway


================
Comment at: clang-tools-extra/clangd/CompletionModel.cmake:19
+      --output_dir ${output_dir}
+      --fname ${fname}
+      --cpp_class ${cpp_class}
----------------
fname -> filename


================
Comment at: clang-tools-extra/clangd/CompletionModel.cmake:29
+    GENERATED 1
+    COMPILE_FLAGS -Wno-unused-label)
+
----------------
this needs to be guarded based on the compiler - other compilers use different flags

I'd suggest just -Wno-usuned


================
Comment at: clang-tools-extra/clangd/CompletionModel.cmake:31
+
+  set(GENERATED_CC ${df_cpp} PARENT_SCOPE)
+  set(DF_INCLUDE ${output_dir} PARENT_SCOPE)  
----------------
It'd be nice to avoid passing data out by setting magic variables with generic names.

The caller is already passing in the filename they want, so they know what it is.


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:1
+import argparse
+import json
----------------
this needs documentation throughout


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:9
+
+class Feature:
+    class Type(Enum):
----------------
Hmm, do these classes really pay for themselves compared to just using the JSON-decoded data structures directly and writing functions?

e.g.

```
def setter(feature):
  if (feature['kind'] == KIND_CATEGORICAL)
    return "void Set{feature}(float V) {{ {feature} = OrderEncode(V); }}".format(feature['name'])
  ...
```


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:10
+class Feature:
+    class Type(Enum):
+        NUMERICAL = 1
----------------
These labels seem a bit abstract, what do you think about "number"/"enum"?


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:21
+        if self.type == Feature.Type.CATEGORICAL:
+            assert 'header' in feature_json, "Header not found in categorical feature."
+            assert 'enum' in feature_json, "Enum not found in categorical feature."
----------------
Assert failures print the expression, so no need to say "header not found" etc.
And in fact the next line will throw a reasonable exception in this case... could drop these


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:28
+        if self.type == Feature.Type.NUMERICAL:
+            return "void Set{feature}(float V) {{ {feature} = OrderEncode(V); }}".format(
+                feature=self.name)
----------------
nit: `set{Feature}`, following LLVM style


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:33
+                feature=self.name)
+        assert False, "Unhandled feature type."
+
----------------
raise instead?

I'm not sure we want this error handling to be turned off... assert is for programming errors
(applies to the rest of the usage of assert too)


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:45
+    return list(
+        set(f.header for f in filter(
+            lambda f: f.type == Feature.Type.CATEGORICAL, features.values())))
----------------
 i think this is `set(f.header for f in features.values() if f.type == Feature.Type.CATEGORICAL)` with no need for lambda
but my python is rusty


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:52
+        self.fname = fname
+        assert not cpp_class.startswith(
+            "::"), "Do not fully qualify class name."
----------------
why not assert this on self.ns so that "::Foo" will work fine?


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:60
+    def ns_begin(self):
+        return ["namespace {ns} {{".format(ns=ns) for ns in self.ns]
+
----------------
join here rather than returning an array?


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:67
+
+    def header_gaurd(self):
+        # Convert fname to Upper Snake case.
----------------
gaurd -> guard


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:70
+        return "GENERATED_CODE_COMPLETION_MODEL_{}_H".format(
+            reduce(lambda x, y: x + ('_' if y.isupper() else '') + y,
+                   self.fname).upper())
----------------
`''.join('_' if x in '-' else x.upper() for x in self.fname)` ?


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:74
+
+class Tree:
+    def __init__(self, json_tree, tree_num: int, node_num: int):
----------------
(again, not clear the classes pay for themselves in terms of complexity. We do a lot of parsing and validation, but it's not clear it's important)


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:103
+
+    def codegen(self, features):
+        code = []
----------------
would be nice to structure to reduce the indentation inside here.
Also 'codegen' is a pretty generic name for this particular piece.

I'd consider 
```
def node(n):
  return {
    'boost': boost_node,
    'if_greater': if_greater_node,
    'if_member': if_member_node,
  }[n['operation']](n)
```

and then define each case separately. (That snippet does actually check errors!)


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:129
+                "BIT({enum}::{member})".format(
+                    enum=features[self.feature].enum, member=member)
+                for member in self.members
----------------
rather than passing features around to get access to the enum type, what about adding `using {feature}_type = ...` to the top of the generated file, and using that here?


================
Comment at: clang-tools-extra/clangd/CompletionModelCodegen.py:282
+
+    with open(model_json) as model_file:
+        forest = read_decision_forest(json.load(model_file))
----------------
putting all the trees in a single huge json file seems like it makes them hard to inspect, did you consider one-per-file?


================
Comment at: clang-tools-extra/clangd/for-review-only/CompletionModel.cpp:28
+  float Score = 0;
+  tree_0:
+  t0_n0: if(E.ContextKind & (BIT(clang::CodeCompletionContext::Kind::CCC_DotMemberAccess)|BIT(clang::CodeCompletionContext::Kind::CCC_ArrowMemberAccess))) goto t0_n2;
----------------
nit: consistently either:
 - tree0, tree0_node1
 - tree_0, tree_0_node_1
 - t0, t0_n1
etc


================
Comment at: clang-tools-extra/clangd/for-review-only/CompletionModel.h:10
+// That is: a < b <==> orderEncode(a) < orderEncode(b).
+uint32_t OrderEncode(float F);
+
----------------
this shouldn't be part of the public interface. Can we make it private static in the model class?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D83814/new/

https://reviews.llvm.org/D83814





More information about the cfe-commits mailing list