Symbolic Regression in Julia

What is it?

A linear regression finds the line that is “closest” to a dataset. In a similar maner, a symbolic regression is an algorithm that find a combination of symbols that minimizes the mean square error of a given dataset. These symbols are unary and binary operators like the + symbol or a function like \(cos\) and \(1/x\).

Example 1

Let’s try to approximate the function \(f(x) = - x^2 + 1\) using the symbols and \(+, -, *\) combined with the variable \(x\).

# import Pkg; 
# Pkg.add(["SymbolicRegression", "MLJ", "SymbolicUtils", "Plots"])

using SymbolicRegression, MLJ, SymbolicUtils
using Plots

x = [-3:0.1:3;]
y = @. - x^2 + 1;

scatter(x, y)

First we define a model

model = SRRegressor(
    binary_operators=[+, -, *],    
    niterations=50,
    seed = 1
);

(Note: the argument seed = 1 is needed to ensure that the result is the same when this Quarto document compiles; you don’t need it.)

And then fit it to our dataset

X = reshape(x, (length(x), 1))

mach = machine(model, X, y)
fit!(mach)

We can see a report about the results:

r = report(mach);

r

(best_idx = 2,
 equations = Expression{Float64, DynamicExpressions.NodeModule.Node{Float64}, @NamedTuple{operators::DynamicExpressions.OperatorEnumModule.OperatorEnum{Tuple{typeof(+), typeof(-), typeof(*)}, Tuple{}}, variable_names::Vector{String}}}[-2.0999999999981576, 1.0 - (x1 * x1)],
 equation_strings = ["-2.0999999999981576", "1.0 - (x1 * x1)"],
 losses = [7.681799999999997, 0.0],
 complexities = [1, 5],
 scores = [36.04365338911715, 9.010913347279288],)

This report contains the losses

r.losses

2-element Vector{Float64}:
 7.681799999999997
 0.0

the equations

r.equations

2-element Vector{Expression{Float64, DynamicExpressions.NodeModule.Node{Float64}, @NamedTuple{operators::DynamicExpressions.OperatorEnumModule.OperatorEnum{Tuple{typeof(+), typeof(-), typeof(*)}, Tuple{}}, variable_names::Vector{String}}}}:
 -2.0999999999981576
 1.0 - (x1 * x1)

and the best one of the functions found (ie. the one with the least loss):

node_to_symbolic(r.equations[r.best_idx], model)

1.0 - (x1 * x1)

Here, we can read \(x_1\) as \(x\), because we only have one variable.

Notice that this expression simplifies to our original \(f\).

Example 2

Now let’s get a more interesting example. Take \(f(x) = x^2 + 2cos(x)^2\):

y = @. x^2 + 2cos(x)^2 

scatter(x, y)

We again create a model and fit it, but now we allow more operations: besides the earlier binary functions, we also have the unary cos function:

model = SRRegressor(
    binary_operators = [+, -, *],    
    unary_operators = [cos],
    niterations=50,
    seed = 1
);

mach = machine(model, X, y)
fit!(mach)

and see the best equation:

r = report(mach)
node_to_symbolic(r.equations[r.best_idx], model)

(cos(x1 + x1) + (x1 * x1)) + 1.0

So, we got

\[ x * x + cos(x + x) - (-1) = x^2 + cos(2x) + 1 \]

Since \(cos(2x) + 1 = 2cos^2(x)\), we retrieve the original function.

Example 3

Even after adding some noise to the original dataset, the symbolic regression still can find a very good approximation:

Take \(f(x) = 0.3 * x^3 - x^2 + 2cos(x) + \epsilon(x)\) where \(\epsilon(x)\) is a random uniform error (varying in \([0, 1]\)) like this:

x = [-5:0.1:5;]
X = reshape(x, (length(x), 1))
errors = rand(length(x))
y = @. 0.3*x^3 - x^2 + 2cos(x) + errors

scatter(x, y)

model = SRRegressor(
    binary_operators = [+, -, *],    
    unary_operators = [cos],
    niterations=60,
    seed = 1
);

mach = machine(model, X, y)
fit!(mach)

and see the best equation:

r = report(mach)
node_to_symbolic(r.equations[r.best_idx], model)

((cos(x1) + 0.25165473630073826) * 2.0606237458400627) + ((((x1 * 0.29992726806072795) + -1.0024810337832324) * x1) * x1)

We can plot the prediction and the original dataset to compare them:

y_pred = predict(mach, X)
 
scatter(x, y);
scatter!(x, y_pred, color = "red")

Not bad at all!

You can see more about this package in this link. If you have enough courage, read the original paper on arxiv!