# import Pkg;
# Pkg.add(["SymbolicRegression", "MLJ", "SymbolicUtils", "Plots"])
What is it?
A linear regression finds the line that is “closest” to a dataset. In a similar maner, a symbolic regression is an algorithm that find a combination of symbols that minimizes the mean square error of a given dataset. These symbols are unary and binary operators like the + symbol or a function like \(cos\) and \(1/x\).
Example 1
Let’s try to approximate the function \(f(x) = - x^2 + 1\) using the symbols and \(+, -, *\) combined with the variable \(x\).
using SymbolicRegression, MLJ, SymbolicUtils
using Plots
= [-3:0.1:3;]
x = @. - x^2 + 1;
y
scatter(x, y)
First we define a model
= SRRegressor(
model =[+, -, *],
binary_operators=50,
niterations= 1
seed );
(Note: the argument seed = 1
is needed to ensure that the result is the same when this Quarto document compiles; you don’t need it.)
And then fit it to our dataset
= reshape(x, (length(x), 1))
X
= machine(model, X, y)
mach fit!(mach)
We can see a report about the results:
= report(mach);
r
r
(best_idx = 2,
equations = Expression{Float64, DynamicExpressions.NodeModule.Node{Float64}, @NamedTuple{operators::DynamicExpressions.OperatorEnumModule.OperatorEnum{Tuple{typeof(+), typeof(-), typeof(*)}, Tuple{}}, variable_names::Vector{String}}}[-2.100000021910389, 1.0 - (x1 * x1)],
equation_strings = ["-2.100000021910389", "1.0 - (x1 * x1)"],
losses = [7.681799999999997, 0.0],
complexities = [1, 5],
scores = [36.04365338911715, 9.010913347279288],)
This report contains the losses
r.losses
2-element Vector{Float64}:
7.681799999999997
0.0
the equations
r.equations
2-element Vector{Expression{Float64, DynamicExpressions.NodeModule.Node{Float64}, @NamedTuple{operators::DynamicExpressions.OperatorEnumModule.OperatorEnum{Tuple{typeof(+), typeof(-), typeof(*)}, Tuple{}}, variable_names::Vector{String}}}}:
-2.100000021910389
1.0 - (x1 * x1)
and the best one of the functions found (ie. the one with the least loss):
node_to_symbolic(r.equations[r.best_idx], model)
1.0 - (x1 * x1)
Here, we can read \(x_1\) as \(x\), because we only have one variable.
Notice that this expression simplifies to our original \(f\).
Example 2
Now let’s get a more interesting example. Take \(f(x) = x^2 + 2cos(x)^2\):
= @. x^2 + 2cos(x)^2
y
scatter(x, y)
We again create a model and fit it, but now we allow more operations: besides the earlier binary functions, we also have the unary cos
function:
= SRRegressor(
model = [+, -, *],
binary_operators = [cos],
unary_operators =50,
niterations= 1
seed
);
= machine(model, X, y)
mach fit!(mach)
and see the best equation:
= report(mach)
r node_to_symbolic(r.equations[r.best_idx], model)
(x1 * x1) + ((cos(cos(x1 * (x1 * -1.6091339136968086))) * 2.8155375166732943e-16) + (cos(x1 + x1) - -0.9999999999999998))
So, we got
\[ x * x + cos(x + x) - (-1) = x^2 + cos(2x) + 1 \]
Since \(cos(2x) + 1 = 2cos^2(x)\), we retrieve the original function.
Example 3
Even after adding some noise to the original dataset, the symbolic regression still can find a very good approximation:
Take \(f(x) = 0.3 * x^3 - x^2 + 2cos(x) + \epsilon(x)\) where \(\epsilon(x)\) is a random uniform error (varying in \([0, 1]\)) like this:
= [-5:0.1:5;]
x = reshape(x, (length(x), 1))
X = rand(length(x))
errors = @. 0.3*x^3 - x^2 + 2cos(x) + errors
y
scatter(x, y)
= SRRegressor(
model = [+, -, *],
binary_operators = [cos],
unary_operators =60,
niterations= 1
seed
);
= machine(model, X, y)
mach fit!(mach)
and see the best equation:
= report(mach)
r node_to_symbolic(r.equations[r.best_idx], model)
(((-1.0010142242423987 - (x1 * -0.300544547485655)) * (x1 * x1)) + 0.4912169009541896) + (cos(x1) * 1.9796250803644289)
We can plot the prediction and the original dataset to compare them:
= predict(mach, X)
y_pred
scatter(x, y);
scatter!(x, y_pred, color = "red")
Not bad at all!
You can see more about this package in this link. If you have enough courage, read the original paper on arxiv!