[PATCH] D59820: [llvm-exegesis] Introduce a 'naive' clustering algorithm (PR40880)

Wed Mar 27 07:30:23 PDT 2019

courbet added inline comments.

================
Comment at: tools/llvm-exegesis/lib/Clustering.cpp:57
+// Given a set of points, checks that all the points are neighbours.
+bool InstructionBenchmarkClustering::areAllNeighbours(
+    ArrayRef<size_t> Pts) const {
----------------
lebedev.ri wrote:
> courbet wrote:
> > lebedev.ri wrote:
> > > courbet wrote:
> > > > courbet wrote:
> > > > > This is O(N^2).  You can do it in O(N): compute the cluster centroid (O(N)), then compute distance from each point to centroid (O(N)).
> > > > > 
> > > > > This relies on the fact that if there exists `p` and `q` such that `d(p,q) > e`, then either `d(p, centroid) > e/2` or `d(q, centroid) > e/2.
> > > > > 
> > > > > Proof (ad absurdum):
> > > > > Assume both `d(p, centroid) <= e/2` and `d(q, centroid) <= e/2`. Then:
> > > > > 
> > > > > ```
> > > > > d(p, centroid)  + d(q, centroid) <= e
> > > > > ```
> > > > > 
> > > > > By symmetry:
> > > > > 
> > > > > ```
> > > > > d(p, centroid)  + d(centroid, q) <= e
> > > > > ```
> > > > > 
> > > > > By [[ https://en.wikipedia.org/wiki/Triangle_inequality#Metric_space | triangle inequality ]]:
> > > > > 
> > > > > 
> > > > > ```
> > > > > d (p, q) <= d(p, centroid)  + d(q, centroid) <= e
> > > > > ```
> > > > > 
> > > > > 
> > > > > 
> > > > Oops, I just realized that I proved the opposite direction of what I wanted. Two options:
> > > > 
> > > >   - We consider that this criterion is as good as the other one to decide that we should reject the cluster, or
> > > >   - We take another approach such as computing the bounding box of the cluster (O(N)), then compare its diagonal (distance between the two extremal points, O(1)) to the rejection threshold.
> > > > 
> > > > 
> > > This //appears// to be sufficient.
> > > 
> > > Though, i think we can replace `getAsPoint()` with `getMinPoint()`+`getMaxPoint()`,
> > > and just compare that these `Pmin` and `Pmax` are neighbours up to `AnalysisClusteringEpsilon`? (not halved!)
> > > Maybe that is even less controversial?
> > Sounds perfect to me.
> Hm, i should have specified, i mostly only considered the current 1D case.
> I did not think about the situation with more dimensions.
> 
> I **believe** that for the current 1D situation, the original brute-force approach,
> this `O(2*N)` approach, and the `O(N+1)` approach, are all identical.
> I'm not sure about 2D case, especially because it is presently theoretical, i can't test it.
> 
> Thinking about it more, i'm not fully convinced that the Pmin/Pmax solution would be correct for 2D.
> 
> I think we should keep this at least for now, unless don't believe that it is correct?
> I don't expect it to be the performance bottleneck.
> I believe that for the current 1D situation, the original brute-force approach, this O(2*N) approach, and the O(N+1) approach, are all identical.

If there are `D` dimensions and N points, the original approach would be O(D*N^2) (all pairs of points * one distance computation for each pair), the second one O(D*N) (all points * accumulate each coordinate + all points * distance computation), and the third one O(D*N + D) (all points * {min+max computation over each coordinate} + distance computation between min bound and max bound).

> I'm not sure about 2D case, especially because it is presently theoretical, i can't test it.

`-mode=uops` gives you a 9-dimentional vector (ports 0-7 + `NumMicroOps`)

> Thinking about it more, i'm not fully convinced that the Pmin/Pmax solution would be correct for 2D.

I don't think there is really a "correct" way - eventually this is just a heuristic to mark clusters as noise, anything reasonable works :)

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59820/new/

https://reviews.llvm.org/D59820