This post describes the new invaiant introduced in the paper “The mergegram of a dendrogram and its stability” (arxiv:2007.11278) by Yury Elkin and Vitaliy Kurlin, which was peer-review and presented in a 25-min video at MFCS 2020 (Mathematical Foundations of Computer Science).
The mergegram is the new intermediate stable-under-noise invariant of a cloud of a points. The mergegram sits between (1) the 0-dimensional persistence of a distance-based filtration on the cloud, which is stable, but weak as an isometry invariant, and (2) the dendrogram of clustering, which is much stronger, but is combinatorially unstable under perturbations of given points. The picture below shows a dendrogram on the left and its mergegram on the right.
What is a single-linkage clustering?
A clustering is any splitting of a given dataset into subgroups or clusters so that data points within each cluster are more similar to each other to data points from other clusters.
There are many measures of similarity, hence thousands of clustering algorithms.
The single-linkage (SL) clustering is one of the simplest methods, which can be applied to any abstract data points given by pairwise distances between them. For a distance threshold s>0, two points a,b belong to the same single-linkage cluster, if they are joined by a finite sequence of points a=a1, a2, … , ak=b such that the distance between any successive points is at most 2s. Here the factor 2 is chosen only for convenience of further constructions. Instead of a distance at most 2s one can say that the closed balls of radius s centered at given points intersect.
For the scale s=0, each given point form its own 1-point cluster. For any large enough s (larger than the maximum distance between given points), all points are in the same single cluster.
What is a dendrogram of clustering?
The dendrogram of clustering visualizes how cluster evolve when a threshold grows. For the cloud of 10 points in the Euclidean plane, the SL clustering starts from 10 isolated points at threshold or scale s=0.
A denrogram is a tree whose edges represent evolving clusters and internal nodes represent mergers of clusters. Each vertical edge `spans a life’ of a cluster while a scale s is increasing. The ten vertical edges at the bottom of the dendrogram above correspond to 10 given points.
A dendrogram is well-defined not only for the single-linkage clustering, but for any hierarchical clustering algorithm. A clustering is called hierarchical if by increasing a threshold one can only merge clusters, but never split them.
Instability of a dendrogram under noise.
More than two clusters can simultaneously merge into one. The 10-point cloud in the left picture above has the dendrogram on the right, where two quadruples of points merge into two clusters at the scale of s≈1.1.
Slightly perturbing one of these points, one can transform any 4-to-1 merger into a sequence of simpler 3-to-1 or 2-to-1 mergers of clusters. Hence the combinatorial structure, e.g. the list of vertex degrees, of a dendrogram is unstable under perturbations of points.
All real life data are noisy, hence our data descriptors should tolerate at least small noise.
Moreover, the problem of quantifying a similarity requires that any slightly different data objects have close values of their descriptors.
What is the mergegram of a dendrogram?
The mergegram resolves the instability of a dendrogram by extracting its stable pairs (birth,death) or dots in the plane diagram below. This stability will be discussed in a future post.
The mergegram of a dendrogram consists of the pairs (birth,death) in a 1-1 correspondence with all clusters that live from a scale s=birth to a scale s=death. If k clusters have identical life spans from their birth to death, then the mergegram includes the pair (birth, death) with multiplicity k.
In the example below at the scale s=0.5, the points 9,10 merge into the cluster {9,10}, so the two rightmost vertical edges merge into one (shows as joined by a horizontal edge in the dendrogram). This merger contributes two pairs (0,0.5) to the mergregram.
The cluster {9,10} lives from its birth at the scale s=0.5 until its merger with the cluster {4,.6}(death at s=1.5), hence contributes the pair (0.5,1.5) to the mergregram on the right below.
Experts in Topological Data Analysis have already recognized that the mergegram of pairs (birth,death) looks similar to a persistence diagram. In the case of clustering, the corresponding 0-dimensional persistence has all births equal to 0 (points born at scale 0), hence only pairs of the form (0,death). The mergegram contains pairs with positive births. A future post will include a couple of clouds with identical persistence, but different mergegrams, see Q2 below.
Exercises on understanding the new concept of a mergegram.
- Q1. Find a cloud C of points in the real line whose SL dendrogram is unstable under perturbation of points. Draw the mergegram of C or list its pairs in the form (birth,death).
- Q2. Let a cloud A in the real line have ordered points a1<a2<…<an. Find another cloud B of ordered points b1<b2<…<bn such that (1) B is not related to A by a translation or reflection, (2) A has the same unordered set of distances ak+1-ak between successive points as B, (3) the mergegrams of \(A\) and \(B\) are different.
You could write a brief answer or feedback in your comment: reply.