5  Point clouds and distances

“To see a World in a Grain of Sand
And a Heaven in a Wild Flower,
Hold Infinity in the palm of your hand
And Eternity in an hour.”
— William Blake, in “Auguries of Innocence”

Before we can apply topological methods to data, we need to understand what “data” looks like from a geometric perspective. In TDA, data typically comes as a point cloud — a finite set of points in some metric space. This chapter covers the essential tools for working with point clouds: loading and generating them, measuring distances, estimating density, and visualizing structure.

5.1 What is a point cloud?

Definition 5.1 A point cloud is a finite set \(X = \{x_1, \ldots, x_n\}\) where each \(x_i \in \mathbb{R}^d\) for some dimension \(d\). Once we choose a distance function \(d : X \times X \to \mathbb{R}\), the pair \((X, d)\) becomes a finite metric space.

Point clouds appear everywhere:

  • GPS coordinates in \(\mathbb{R}^2\);
  • 3D scans in \(\mathbb{R}^3\);
  • images viewed as points in \(\mathbb{R}^{784}\);
  • sensor measurements in \(\mathbb{R}^d\);
  • word embeddings in \(\mathbb{R}^{300}\) or higher.

Throughout this book, the conceptual convention is still the same as before: one point per column, one coordinate per row. In the local JuliaTDA stack, point clouds are represented by EuclideanSpace objects from MetricSpaces.jl. If your data already comes as a d \times n matrix, EuclideanSpace(M) converts it. When you want a matrix back for plotting or inspection, use as_matrix(X).

5.2 Generating point clouds

Synthetic datasets are useful because we know their geometry in advance. The MetricSpaces.Datasets module provides convenient generators for standard shapes:

using CairoMakie
using MetricSpaces
using MetricSpaces.Datasets

5.2.1 Circles and spheres

circle = sphere(200, dim = 2)
sphere3d = sphere(500, dim = 3)

circle_mat = as_matrix(circle)
sphere_mat = as_matrix(sphere3d)

fig = Figure(size = (800, 400))
ax1 = Axis(fig[1, 1], title = "Circle (200 points)", aspect = DataAspect())
scatter!(ax1, circle_mat[1, :], circle_mat[2, :], markersize = 5)

ax2 = Axis3(fig[1, 2], title = "Sphere (500 points)")
scatter!(ax2, sphere_mat[1, :], sphere_mat[2, :], sphere_mat[3, :], markersize = 3)
fig

5.2.2 Torus

torus_cloud = torus(1000)
torus_mat = as_matrix(torus_cloud)

fig = Figure()
ax = Axis3(fig[1, 1], title = "Torus (1000 points)")
scatter!(ax, torus_mat[1, :], torus_mat[2, :], torus_mat[3, :], markersize = 3)
fig

5.3 Loading real datasets

Many real datasets come in CSV or text files. The clustering datasets/ folder in this repository contains several FCPS benchmark datasets (Ultsch and Lötsch 2020).

using DelimitedFiles

hepta_raw = readdlm("clustering datasets/Hepta.csv", ',', Any)
hepta = EuclideanSpace(permutedims(Float64.(hepta_raw[2:end, 2:end])))
hepta_labels = Int.(hepta_raw[2:end, 1])
hepta_mat = as_matrix(hepta)

fig = Figure()
ax = Axis3(fig[1, 1], title = "Hepta dataset")
scatter!(ax, hepta_mat[1, :], hepta_mat[2, :], hepta_mat[3, :], color = hepta_labels, markersize = 8)
fig
target_raw = readdlm("clustering datasets/Target.csv", ',', Any)
target = EuclideanSpace(permutedims(Float64.(target_raw[2:end, 2:end])))
target_labels = Int.(target_raw[2:end, 1])
target_mat = as_matrix(target)

fig = Figure()
ax = Axis(fig[1, 1], title = "Target dataset", aspect = DataAspect())
scatter!(ax, target_mat[1, :], target_mat[2, :], color = target_labels, markersize = 5)
fig

5.4 Distance metrics

The choice of distance function profoundly affects any analysis. The same finite set of points can look very different depending on which notion of distance we put on it.

Definition 5.2 A metric on a set \(X\) is a function \(d : X \times X \to \mathbb{R}\) satisfying:

  1. \(d(x, y) \geq 0\) and \(d(x, y) = 0 \iff x = y\);
  2. \(d(x, y) = d(y, x)\);
  3. \(d(x, z) \leq d(x, y) + d(y, z)\).

MetricSpaces.jl exports several common distances:

x = [1.0, 2.0, 3.0]
y = [4.0, 5.0, 6.0]

println("Euclidean: ", dist_euclidean(x, y))
println("Cityblock: ", dist_cityblock(x, y))
println("Chebyshev: ", dist_chebyshev(x, y))
println("Cosine:    ", dist_cosine(x, y))
Euclidean: 5.196152422706632
Cityblock: 9.0
Chebyshev: 3.0
Cosine:    0.025368153802923787

For point clouds, we often compute the full pairwise distance matrix:

X = sphere(6, dim = 2)
D = pairwise_distance(X, X, dist_euclidean)
D
6×6 Matrix{Float64}:
 0.0       0.17935   0.808482  1.91401  0.746766  1.79574
 0.17935   0.0       0.641181  1.95832  0.910136  1.70954
 0.808482  0.641181  0.0       1.98517  1.43304   1.28653
 1.91401   1.95832   1.98517   0.0      1.55897   1.36356
 0.746766  0.910136  1.43304   1.55897  0.0       1.99464
 1.79574   1.70954   1.28653   1.36356  1.99464   0.0

The matrix \(D\) stores all pairwise distances \(D_{ij} = d(x_i, x_j)\). Up to isometry, this matrix contains all the geometric information about the point cloud.

5.5 Density estimation

Given a point cloud, a natural question is: where are the points concentrated? Density estimation assigns a “height” to each point and turns the cloud into a discrete landscape. MetricSpaces.jl provides a kernel density estimator for exactly that.

using Random

rng = MersenneTwister(42)
X = EuclideanSpace(
    hcat(
        randn(rng, 2, 300),
        0.4 .* randn(rng, 2, 100) .+ [3.0, 0.0],
    ),
)
Xmat = as_matrix(X)
ds = kde(X, bandwidth = 0.5)

fig = Figure(size = (800, 400))
ax1 = Axis(fig[1, 1], title = "Point cloud", aspect = DataAspect())
scatter!(ax1, Xmat[1, :], Xmat[2, :], markersize = 4)

ax2 = Axis(fig[1, 2], title = "Colored by density", aspect = DataAspect())
scatter!(ax2, Xmat[1, :], Xmat[2, :], color = ds, markersize = 4)
fig

The density of a point \(x\) is estimated by averaging a kernel over all distances to the rest of the dataset:

\[ \mathrm{dens}(x) = \frac{1}{|X|} \sum_{y \in X} K\!\left(\frac{d(x, y)}{h}\right), \]

where \(h\) is the bandwidth. Larger \(h\) produces smoother densities; smaller \(h\) reveals finer local structure. Density estimation is one of the basic ingredients behind the ToMATo clustering algorithm that we will study later.

5.6 Nearest neighbors and local neighborhoods

Another fundamental operation is finding nearby points. Point-cloud algorithms rarely use the entire distance matrix directly; they usually work with local neighborhoods.

Conceptually, two standard constructions are:

  • the \(\epsilon\)-neighborhood graph, where points are connected when they lie within distance \(\epsilon\) of each other;
  • the \(k\)-nearest-neighbor graph, where each point is connected to its \(k\) closest neighbors.

MetricSpaces.jl exposes those local queries directly:

X = sphere(80, dim = 2)
center = X[1]

ids = ball_ids(X, center, 0.45)
println("Points in the ball: ", ids)
println("Number of points in the ball: ", length(ids))
Points in the ball: [1, 4, 13, 23, 26, 45, 47, 49, 55, 56, 57, 78]
Number of points in the ball: 12

These local neighborhoods are the building blocks behind proximity graphs, Ball Mapper, Vietoris-Rips complexes, and density-based clustering.

5.7 Summary

In this chapter we learned to:

  • represent data as point clouds using EuclideanSpace;
  • generate synthetic shapes with MetricSpaces.Datasets;
  • load benchmark datasets from CSV files;
  • compute distances with MetricSpaces.jl;
  • estimate densities with kernel methods;
  • query local neighborhoods.

These are the geometric foundations we will use throughout the rest of the book. Next, we turn to clustering point clouds and see both what classical methods can do well and where topology becomes necessary.