9  ToMATo

Given a point cloud \(X\), how can topology help us discover clusters?

The ToMATo algorithm combines three ingredients:

Once those ingredients are in place, ToMATo climbs the density landscape, tracks how peaks merge, and uses persistence to decide which peaks are real clusters and which are just noise.

One good way to understand ToMATo is to pretend we are inventing it from scratch. We will start with a toy point cloud, turn it into a landscape, and then ask how to follow that landscape toward stable peaks.

9.1 Setup

using CairoMakie
using MetricSpaces
using TDAplots
using ToMATo
using Random

9.2 A point cloud with three peaks

We start with a synthetic dataset containing two large clusters and one smaller one:

rng = MersenneTwister(0)
X = EuclideanSpace(
    hcat(
        randn(rng, 2, 800),
        0.8 .* randn(rng, 2, 800) .+ [4.0, -4.0],
        0.3 .* randn(rng, 2, 100) .+ [-2.0, 2.0],
    ),
)
Xmat = as_matrix(X)

fig = Figure()
ax = Axis(fig[1, 1], title = "Three noisy clusters", aspect = DataAspect())
scatter!(ax, Xmat[1, :], Xmat[2, :], markersize = 4)
fig

We can already guess that there are three dense regions, but ToMATo will make that intuition precise.

9.3 Density

The local JuliaTDA stack provides a kernel density estimator in MetricSpaces.jl:

ds = kde(X, bandwidth = 0.5)

fig = Figure(size = (800, 400))
ax1 = Axis(fig[1, 1], title = "Point cloud", aspect = DataAspect())
scatter!(ax1, Xmat[1, :], Xmat[2, :], markersize = 4)

ax2 = Axis(fig[1, 2], title = "Colored by density", aspect = DataAspect())
scatter!(ax2, Xmat[1, :], Xmat[2, :], color = ds, markersize = 4)
fig

The corresponding density landscape looks like a collection of mountains:

fig = Figure()
ax = Axis3(fig[1, 1], title = "Density landscape")
scatter!(ax, Xmat[1, :], Xmat[2, :], ds, color = ds, markersize = 4)
fig

Informally, points near the top of a mountain are local density maxima. Those are the candidate cluster centers.

Conceptually, the density of a point is obtained by averaging a kernel over its distances to the rest of the dataset:

\[ \mathrm{dens}(x) = \frac{1}{|X|} \sum_{y \in X} K\!\left(\frac{d(x,y)}{h}\right), \]

where \(h\) is the bandwidth. Larger bandwidths smooth the landscape; smaller ones reveal finer local bumps.

9.4 The proximity graph

ToMATo does not work directly with all pairwise distances. Instead, it builds a graph whose vertices are the points of \(X\) and whose edges connect nearby points:

g = proximity_graph(X, 0.5; min_k_ball = 4, max_k_ball = 6, k_nn = 3)

tomato_graph_plot(X, g, ds)

This graph is built point-by-point:

  • first try to connect a point to the points in its \(\epsilon\)-ball;
  • if that ball is too small, fall back to a fixed number of nearest neighbors;
  • if the ball is too large, cap the number of edges.

Those three parameters control how local or global the resulting notion of neighborhood becomes.

At this point we can imagine a mountain-climbing algorithm:

  • if a point is already at a local maximum, it should seed a cluster;
  • otherwise, it should inherit the cluster of a higher-density neighbor.

That is the core idea behind ToMATo.

9.5 Too many peaks

If we run ToMATo with \(\tau = 0\), then every little bump in the density can survive as its own cluster:

clusters_raw, _ = tomato(X, g, ds, 0.0)

tomato_graph_plot(X, g, float.(clusters_raw))

This usually over-segments the data. Every tiny bump in the density can survive as its own cluster, even when it is obviously just a wrinkle on the side of a larger mountain.

9.6 Merging insignificant peaks

ToMATo fixes that by merging peaks whose lifetimes are too short. The parameter \(\tau\) sets the persistence threshold:

τ = 0.01
clusters, _ = tomato(X, g, ds, τ, max_cluster_height = τ)

metricspace_plot(X, color = string.(clusters), markersize = 4)

Now we recover the three clusters we expected. Topologically, what survived was not every local maximum, but every maximum whose persistence was large enough to count as meaningful.

9.7 Choosing \(\tau\)

The usual workflow is to run ToMATo once with \(\tau = \infty\) and inspect the birth-death pairs of the peaks:

_, births_and_deaths = tomato(X, g, ds, Inf)

tomato_persistence_plot(births_and_deaths)

Points far from the horizontal axis correspond to peaks that survive for a long time. Those are the peaks worth keeping. Small points near the axis are short-lived peaks that can be merged away.

If we increase \(\tau\) further, the small cluster eventually merges into one of the large ones:

τ = 0.04
clusters_merged, _ = tomato(X, g, ds, τ, max_cluster_height = τ)

metricspace_plot(X, color = string.(clusters_merged), markersize = 4)

There is no universal value of $`. The persistence plot helps you pick one that matches the scale of the structure you care about. In practice, this is one of ToMATo’s strengths: the algorithm tells you which peaks are stable before you commit to a final clustering.

9.8 The original algorithm

Here is the original schematic description from (Chazal et al. 2011):

The original ToMATo algorithm. Here, \(G\) is the proximity graph, \(\tilde{f}\) is the density vector, \(\tau\) is the persistence threshold, \(\mathbfcal{N}\) is the neighborhood of \(x\) with points higher than \(x\), and \(r\) stores the cluster of each point.

9.9 Summing up

ToMATo needs:

  • a metric space \(X\);
  • a density function \(f\) on \(X\);
  • a proximity graph \(g\) on \(X\);
  • a persistence threshold \(\tau\).

Different choices of density and neighborhood graph lead to different clustering behavior. In this chapter we used a Gaussian-style kernel density estimator and a mixed \(\epsilon\)-ball / k-nearest-neighbor graph, but other combinations are possible.

That flexibility is important: ToMATo is not just a single clustering rule, but a topological framework for combining geometry, density, and persistence.