[PATCH] D59820: [llvm-exegesis] Introduce a 'naive' clustering algorithm (PR40880)

Wed Mar 27 03:31:51 PDT 2019

lebedev.ri added inline comments.

================
Comment at: tools/llvm-exegesis/lib/Clustering.cpp:57
+// Given a set of points, checks that all the points are neighbours.
+bool InstructionBenchmarkClustering::areAllNeighbours(
+    ArrayRef<size_t> Pts) const {
----------------
courbet wrote:
> lebedev.ri wrote:
> > courbet wrote:
> > > courbet wrote:
> > > > This is O(N^2).  You can do it in O(N): compute the cluster centroid (O(N)), then compute distance from each point to centroid (O(N)).
> > > > 
> > > > This relies on the fact that if there exists `p` and `q` such that `d(p,q) > e`, then either `d(p, centroid) > e/2` or `d(q, centroid) > e/2.
> > > > 
> > > > Proof (ad absurdum):
> > > > Assume both `d(p, centroid) <= e/2` and `d(q, centroid) <= e/2`. Then:
> > > > 
> > > > ```
> > > > d(p, centroid)  + d(q, centroid) <= e
> > > > ```
> > > > 
> > > > By symmetry:
> > > > 
> > > > ```
> > > > d(p, centroid)  + d(centroid, q) <= e
> > > > ```
> > > > 
> > > > By [[ https://en.wikipedia.org/wiki/Triangle_inequality#Metric_space | triangle inequality ]]:
> > > > 
> > > > 
> > > > ```
> > > > d (p, q) <= d(p, centroid)  + d(q, centroid) <= e
> > > > ```
> > > > 
> > > > 
> > > > 
> > > Oops, I just realized that I proved the opposite direction of what I wanted. Two options:
> > > 
> > >   - We consider that this criterion is as good as the other one to decide that we should reject the cluster, or
> > >   - We take another approach such as computing the bounding box of the cluster (O(N)), then compare its diagonal (distance between the two extremal points, O(1)) to the rejection threshold.
> > > 
> > > 
> > This //appears// to be sufficient.
> > 
> > Though, i think we can replace `getAsPoint()` with `getMinPoint()`+`getMaxPoint()`,
> > and just compare that these `Pmin` and `Pmax` are neighbours up to `AnalysisClusteringEpsilon`? (not halved!)
> > Maybe that is even less controversial?
> Sounds perfect to me.
Hm, i should have specified, i mostly only considered the current 1D case.
I did not think about the situation with more dimensions.

I **believe** that for the current 1D situation, the original brute-force approach,
this `O(2*N)` approach, and the `O(N+1)` approach, are all identical.
I'm not sure about 2D case, especially because it is presently theoretical, i can't test it.

Thinking about it more, i'm not fully convinced that the Pmin/Pmax solution would be correct for 2D.

I think we should keep this at least for now, unless don't believe that it is correct?
I don't expect it to be the performance bottleneck.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59820/new/

https://reviews.llvm.org/D59820