1 Introduction

1.1 Why Julia?

In a world ruled by Python (and, in some areas, R), why use an exquisite and new language like Julia in this workshop? You can find some good reasons in the book Julia Data Science, but I will give some in the context of topology:

1.1.1 Speed

Julia is fast. You won’t need to use another language to make some expensive computation.

Let’s say you read about a new algorithm to calculate the Vietoris-Rips. If you are a R/Python user, your algorithm won’t be written in R/Python simply because R/Python is slow. Fast code in R/Python is written in C, C++, Fortran or Rust; this is called the two language problem. Julia solves this because with enought knowledge about the compiler and the language, you will get performance nearly as good as if it was written in C. Optimizing Julia code is not a trivial task, but is way easier than learning another language just to get good performance in some functions.

The famous deep learning package `torch` in Python has its core written mostly in C++. Python is just a “glue” interface.

The deep learning package `Flux.jl` is 100% written in Julia.

1.1.2 Use other languages

It is easy to use another language inside Julia.

Even though Julia is fast and has a robust ecosystem, Python and R have many more packages already good-to-go. You can use them easily with tools like PythonCall or RCall.

1.1.3 Mathematical syntax

Julia looks like mathematics and has an elegant syntax.

LaTeX symbols

Being able to mix LaTeX symbols with code can make the code way more readable.

You can find many nice examples on BeautifulAlgorithms.jl. Below are some common Julia code:

# calculate the intersection of two vectors/sets
[1, 2] ∩ [2, 3, 4]

1-element Vector{Int64}:
 2

# check if a value is in a vector/set
1 ∉ [2, 3]

true

# define a function in one line
f(r) = π*r^2

f(3)

28.274333882308138

# Euler's identity
ℯ^(im * π) + 1 |> round

0.0 + 0.0im

# calculating the pairwise-distance between points in a set
X = [1, 2, 3, 4]
d(x, y) = abs(x - y)
[d(xᵢ, xⱼ) for xᵢ ∈ X, xⱼ ∈ X]

4×4 Matrix{Int64}:
 0  1  2  3
 1  0  1  2
 2  1  0  1
 3  2  1  0

Broadcasting

Easily apply a function to all elements of a vector/set:

x = [1, 2, 3, 4]

# broadcast
sin.(x)

4-element Vector{Float64}:
  0.8414709848078965
  0.9092974268256817
  0.1411200080598672
 -0.7568024953079282

# mapping
map(sin, x)

4-element Vector{Float64}:
  0.8414709848078965
  0.9092974268256817
  0.1411200080598672
 -0.7568024953079282

# list comprehension
[sin(x_i) for x_i ∈ x]

4-element Vector{Float64}:
  0.8414709848078965
  0.9092974268256817
  0.1411200080598672
 -0.7568024953079282

Piping

You can compose functions in the reading order. Instead of writing:

sin(cos(1))

0.5143952585235492

you can “pipe” the functions as in:

1 |> cos |> sin

0.5143952585235492

Functional programming

Julia is a functional programming language with type hierarchy, which means that we have “categories” (ie. types) and “functors” (ie. functions) mapping between the types; moreover, its polimorphism means that the functions depend on the type of its arguments.

For example, suppose you want to define the norm of a vector:

x = [1, 2, 3]

# define the norm of a Vector of Numbers
norm(x::Vector{<:Number}) = x.^2 |> sum |> sqrt
norm(x)

3.7416573867739413

You can also define the norm of a function $f$ as the approximate integral of $|f|$ on the interval $[0, 1]$:

# norm on [0, 1]
norm(f::Function; step_size = 0.0001) = 
  [f(x) * step_size for x ∈ 0:step_size:1] .|> 
  abs |> sum

square = x -> x^2
  
norm(square)

0.33338333500000006

which is very close to the real definite integral.

PS: plotting is also as easy as:

using Plots;
plot(square, 0:0.1:1)

Why not define the norm of a text as the amount of characters?

norm(s::AbstractString) = length(s)
norm("Hello!")

1.2 Why TDA?

Topological Data Analysis is a eccentric field of mathematics that applies tools from topology and algebraic topology on the study of datasets. By “datasets”, we almost always mean “finite metric space”.

These tools can be divided in some broad categories:

1.2.1 Homological methods

Persistence homology calculates the “shape” (homology groups) of a object on different scales, and compact this information into objects called “barcodes”. This barcode can be vectorized and then inserted into machine learning models.

The shape of data can be summarised as a barcode.

1.2.2 Graph reduction methods

The Mapper algorithm (see Singh et al. 2007) is the discrete version of the Reeb graph. It captures the geometry of a dataset with respect to a given function (called “filter”) and produce a graph of its pre-images clustered. There is also the Ball Mapper variant (see Dłotko 2019), which does not need a filter function and resembles the Vietoris-Rips filtration. These methods are useful to get a glimpse of the geometry (flares, holes) and areas of interest. For example: if your metric space is a dataset of measurements of patients with diabets, the Mapper graph can represent the patients and the shape of this graph can give insights about the type of diabetes one has.

The Mapper graph of some 3d-shapes retain some of the original geometry.

1.2.3 Clustering methods

Algorithms like ToMATo (see Chazal et al. 2013) are clustering algorithms: we have a metric space as input, and return a (disjoint) partition of this dataset. Clustering is useful whenever we need to “group a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters)”¹.

ToMATo is a persistence-based clustering algorithm.

References

https://en.wikipedia.org/wiki/Cluster_analysis ↩︎

--- jupyter: julia-1.10 --- # Introduction ## Why Julia? In a world ruled by Python (and, in some areas, R), why use an exquisite and new language like [Julia](https://julialang.org/) in this workshop? You can find some good reasons in the book [Julia Data Science](https://juliadatascience.io/programmers), but I will give some in the context of topology: ### Speed ::: {.callout-tip title="" appearance="simple"} Julia is fast. You won't need to use another language to make some expensive computation. ::: Let's say you read about a new algorithm to calculate the Vietoris-Rips. If you are a R/Python user, *your algorithm won't be written in R/Python* simply because R/Python is slow. Fast code in R/Python is written in C, C++, Fortran or Rust; this is called the [two language problem](https://juliadatascience.io/julia_accomplish#sec:two_language). Julia solves this because with enought knowledge about the compiler and the language, you will get performance nearly as good as if it was written in C. Optimizing Julia code is not a trivial task, but is way easier than learning another language just to get good performance in some functions. ![The famous deep learning package [`torch`](https://github.com/pytorch/pytorch) in Python has its core written mostly in C++. Python is just a "glue" interface.](images/intro/torch.png) ![The deep learning package [`Flux.jl`](https://github.com/FluxML/Flux.jl) is 100% written in Julia.](images/intro/flux.png) ### Use other languages ::: {.callout-tip title="" appearance="simple"} It is easy to use another language inside Julia. ::: Even though Julia is fast and has a robust ecosystem, Python and R have many more packages already good-to-go. You can use them easily with tools like [PythonCall](https://github.com/JuliaPy/PythonCall.jl) or [RCall](https://github.com/JuliaInterop/RCall.jl). ### Mathematical syntax ::: {.callout-tip title="" appearance="simple"} Julia looks like mathematics and has an elegant syntax. ::: #### LaTeX symbols Being able to mix LaTeX symbols with code can make the code way more readable. You can find many nice examples on [BeautifulAlgorithms.jl](https://github.com/mossr/BeautifulAlgorithms.jl). Below are some common Julia code: ```{julia} # calculate the intersection of two vectors/sets [1, 2] ∩ [2, 3, 4] ``` ```{julia} # check if a value is in a vector/set 1 ∉ [2, 3] ``` ```{julia} # define a function in one line f(r) = π*r^2 f(3) ``` ```{julia} # Euler's identity ℯ^(im * π) + 1 |> round ``` ```{julia} # calculating the pairwise-distance between points in a set X = [1, 2, 3, 4] d(x, y) = abs(x - y) [d(xᵢ, xⱼ) for xᵢ ∈ X, xⱼ ∈ X] ``` #### Broadcasting Easily apply a function to all elements of a vector/set: ```{julia} x = [1, 2, 3, 4] # broadcast sin.(x) ``` ```{julia} # mapping map(sin, x) ``` ```{julia} # list comprehension [sin(x_i) for x_i ∈ x] ``` #### Piping You can compose functions in the reading order. Instead of writing: ```{julia} sin(cos(1)) ``` you can "pipe" the functions as in: ```{julia} 1 |> cos |> sin ``` #### Functional programming Julia is a [functional programming language](https://en.wikipedia.org/wiki/Functional_programming) with type hierarchy, which means that we have "categories" (ie. types) and "functors" (ie. functions) mapping between the types; moreover, its polimorphism means that the functions depend on the type of its arguments. For example, suppose you want to define the norm of a vector: ```{julia} x = [1, 2, 3] # define the norm of a Vector of Numbers norm(x::Vector{<:Number}) = x.^2 |> sum |> sqrt norm(x) ``` You can also define the norm of a function $f$ as the approximate integral of $|f|$ on the interval $[0, 1]$: ```{julia} # norm on [0, 1] norm(f::Function; step_size = 0.0001) = [f(x) * step_size for x ∈ 0:step_size:1] .|> abs |> sum square = x -> x^2 norm(square) ``` which is very close to the real definite integral. PS: plotting is also as easy as: ```{julia} using Plots; plot(square, 0:0.1:1) ``` Why not define the norm of a text as the amount of characters? ```{julia} norm(s::AbstractString) = length(s) norm("Hello!") ``` ## Why TDA? Topological Data Analysis is a eccentric field of mathematics that applies tools from topology and algebraic topology on the study of datasets. By "datasets", we almost always mean "finite metric space". These tools can be divided in some broad categories: ### Homological methods Persistence homology calculates the "shape" (homology groups) of a object on different scales, and compact this information into objects called "barcodes". This barcode can be vectorized and then inserted into machine learning models. ![The shape of data can be summarised as a barcode.](images/getting-started/ph.png) ### Graph reduction methods The Mapper algorithm [see @singh2007topological] is the discrete version of the [Reeb graph](https://en.wikipedia.org/wiki/Reeb_graph). It captures the geometry of a dataset with respect to a given function (called "filter") and produce a graph of its pre-images clustered. There is also the Ball Mapper variant [see @dlotko2019ball], which does not need a filter function and resembles the Vietoris-Rips filtration. These methods are useful to get a glimpse of the geometry (flares, holes) and areas of interest. For example: if your metric space is a dataset of measurements of patients with diabets, the Mapper graph can represent the patients and the shape of this graph can give insights about the type of diabetes one has. ![The Mapper graph of some 3d-shapes retain some of the original geometry.](images/intro/mapper.png) ### Clustering methods Algorithms like ToMATo [see @chazal2013persistence] are _clustering algorithms_: we have a metric space as input, and return a (disjoint) partition of this dataset. Clustering is useful whenever we need to "group a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters)"^[[https://en.wikipedia.org/wiki/Cluster_analysis](https://en.wikipedia.org/wiki/Cluster_analysis)]. ![ToMATo is a persistence-based clustering algorithm.](images/intro/tomato.png)