[PATCH] D78429: [clangd] Metric tracking through Tracer

Sun Apr 19 16:01:34 PDT 2020

sammccall added a comment.

Thanks for tackling this! Mostly API stuff, obviously :-)

================
Comment at: clang-tools-extra/clangd/Trace.h:35
+/// itself to active tracer on destruction.
+struct Metric {
+  enum Type {
----------------
Conceptually there's a difference between a metric (the thing being measured, such as "latency distribution by method") and the measurement (such as "code completion took 90ms this time").
The attributes such as name, type, permitted labels are associated with the metric - these are schema. Whereas the measurement is a tuple `(metric, actual label, value)` - it doesn't make sense to have measurements for the same metric with different types. And the metric is immutable while the measurement is transient.

For the API (to emit data) it seems natural to model this split using constexpr objects for the metrics, and methods for the measurement:
```
constexpr Metric IncomingLatencyMS("/clangd/latency", Distribution);
...
void handleCodeCompletion() {
  ...
  IncomingLatencyMS.record(90);
}
```
(IOW I think our favorite metrics framework got this right)

For the SPI (to consume data), things are a bit muddier. For I think we want to avoid having a protocol to explicitly register metrics, so there's an attraction to just reporting the flat `(name, type, label, value)` tuple. I also quite like the idea of reusing the type above and reporting this as `(const Metric&, label, value)` which makes extending Metric more natural.

================
Comment at: clang-tools-extra/clangd/Trace.h:36
+struct Metric {
+  enum Type {
+    /// Occurence of an event, e.g. cache access.
----------------
I think we're missing an important type: a value directly provided by the measurement, such as memory usage.

================
Comment at: clang-tools-extra/clangd/Trace.h:36
+struct Metric {
+  enum Type {
+    /// Occurence of an event, e.g. cache access.
----------------
sammccall wrote:
> I think we're missing an important type: a value directly provided by the measurement, such as memory usage.
This enum is about defining the relationship between the metric and the measurements, but also about the (related) nature of the metric itself.

The current names are measurement-centric, which is good for the former but not so much for the latter, and a bit confusing if we set this enum when initializing the *metric*. I'd consider having this be metric-centric and documenting the measurements instead. e.g.

```
enum MetricType {
  /// A number whose value is meaningful, and may vary over time.
  /// Each measurement replaces the current value.
  Value,

  /// An aggregate number whose rate of change over time is meaningful.
  /// Each measurement is an increment for the counter.
  Counter,

  /// A distribution of values with a meaningful mean and count.
  /// Each measured value is a sample for the distribution.
  /// The distribution is assumed not to vary, samples are aggregated over time.
  Distribution,
};
```

================
Comment at: clang-tools-extra/clangd/Trace.h:42
+  };
+  /// Uniquely identifies the metric.
+  const llvm::StringLiteral Name;
----------------
An example to show the naming scheme?
I'd suggest just "method_latency" or so for most things - we can add "foo/bar" or "foo.bar" for hierarchy if needed, but I don't think a leading slash or a "clangd" in the path add anything. 

================
Comment at: clang-tools-extra/clangd/Trace.h:46
+  /// of a cache access.
+  std::vector<std::string> Labels;
+  double Value = 0;
----------------
(Not clear whether this is the allowed set of labels for a metric, bag of tags for a measurement, tuple of coordinates for a measurement...)

What are the design goals around labels? Can we make them as simple as possible (or remove them) for now?

Take the example of cache hit/miss.
There are at least two ways to model this without labels:
 - `cache.hit` and `cache.miss` counters
 - `cache_hit` distribution: measure 1 for hit, 0 for miss. Mean is the hit rate, count is the number of attempts.

The latter seems to capture the structure pretty well, but doesn't generalize (what if there are three categories?). The former seems almost equivalent to labels though. 

Benefits of labels:
 - part of the measurement, therefore can be dynamic (like filenames)
 - explicit that it's meaningful to aggregate across labels (naming conventions are OK for this too for one dimension)
 - generalizes sensibly to multiple dimensions (schema/docs become important)
 - quoting/escaping is annoying for labels with funny characters (e.g. file paths)
 - can bridge to other systems without knowledge of metric structure/semantics (this seems like a bad idea)

The first benefit seems pretty strong actually. I would suggest only supporting a single label though - it's a way simpler interface, we avoid complicated metrics whose schema might need documentation and enforcement,  and we avoid any hint of memory allocation in the common case (just passing around a stringref that points to a literal, a filename we own, etc).

================
Comment at: clang-tools-extra/clangd/Trace.h:48
+  double Value = 0;
+  Type Ty;
+
----------------
nit: since you have an unscoped enum, give the good name to the member and the bad name to the enum (which users need rarely spell)

================
Comment at: clang-tools-extra/clangd/Trace.h:57
+/// Creates a context that will report its duration as a metric on destruction.
+Context latencyTrackingContext(llvm::StringRef OpName);
+
----------------
Events we measure the duration of seem so closely related to trace spans that I don't think we can provide both without making it clear how they should relate to each other.

How horrible do you think it would be to have a second `Metric* Measure = nullptr` in the `Span` constructor?

It's hopelessly coupling together these two things that could otherwise be split, but...

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D78429/new/

https://reviews.llvm.org/D78429