Design for model discrimination using symbolic regression

Most experimental design focusses on parameter precision, where the model structure is assumed known and fixed. But arguably finding the correct model structure is the part of the modelling process that takes the most effort. In this blog we will look at automating this process using symbolic regression, and to do this with gathering too much data.

The Julia packages that we will use:

using SymbolicRegression
using Symbolics
using Distributions
using Optimization, OptimizationBBO
using Plots
using Random; Random.seed!(12345)

We will try to discover the equation: $$y(x) = \exp(-x)\sin(2\pi x) + \cos(\frac{\pi}{2}x), \qquad 0 \leq x \leq 10,$$ automatically from data. Translated into Julia code:

y(x) = exp(-x)*sin(2π*x) + cos(π/2*x)
y(0.0)

1.0

As a baseline design let us gather 10 points randomly from the design space.

n_obs = 10
design_region = Uniform(0.0,10.0)
X = rand(design_region,n_obs)
Y = y.(X)
plot(0.0:0.1:10.0,y.(0.0:0.1:10.0),label="true model",lw=5,ls=:dash);
scatter!(X,Y,ms=5,label="data");
plot!(xlabel="x",ylabel="y", ylims=(-1.2,1.8));
plot!(tickfontsize=12, guidefontsize=14, legendfontsize=8, grid=false, dpi=600)

Now let us perform symbolic regression on this dataset. We will look for 10 model structures that fit the data.

options = SymbolicRegression.Options(
    unary_operators = (exp, sin, cos),
    binary_operators=(+, *, /, -),
    seed=123,
    deterministic=true,
    verbosity=false,
    save_to_file=false,
    defaults=v"0.24.5"
)
hall_of_fame = EquationSearch(X', Y, options=options, niterations=100, runtests=false, parallelism=:serial)
n_best_max = 10
#incase < 10 model structures were returned
n_best = min(length(hall_of_fame.members),n_best_max) 
best_models = sort(hall_of_fame.members,by=member->member.loss)[1:n_best]

10-element Vector{PopMember{Float64, Float64, Expression{Float64, Node{Float64}, @NamedTuple{operators::DynamicExpressions.OperatorEnumModule.OperatorEnum{Tuple{typeof(+), typeof(*), typeof(/), typeof(-)}, Tuple{typeof(exp), typeof(sin), typeof(cos)}}, variable_names::Vector{String}}}}}:
 PopMember(tree = ((cos(x1 / 0.556922741892079) / (((sin(x1) + exp(x1)) + x1) + exp(-0.18983334693465292))) + cos(x1 / 0.6366135607609045)), loss = 1.353844447287992e-7, score = 0.064000242074528)
 PopMember(tree = ((cos(x1 / 0.556922741892079) / ((x1 + exp(x1)) + 1.7500726412136087)) + (cos(x1 / 0.6366135607609045) + 0.0004135556561892)), loss = 2.945164517643709e-7, score = 0.05760053521668158)
 PopMember(tree = (cos(x1 / 0.6366314628584495) - ((((-2.0209446503113644 - x1) - (-4.128483985539092 / x1)) + x1) / exp(x1))), loss = 9.307105294996532e-7, score = 0.05440169147743159)
 PopMember(tree = (cos(x1 / 0.6366314628584495) - ((-2.0209446503113644 - (-4.128483985539092 / x1)) / exp(x1))), loss = 9.307105294996631e-7, score = 0.04160169459475451)
 PopMember(tree = (cos(x1 / 0.6366314628584495) - sin((-2.0209446503113644 - (-4.128483985539092 / x1)) / (exp(x1) - 0.021122856968733148))), loss = 9.369243827736577e-7, score = 0.051201704501225895)
 PopMember(tree = (cos(x1 / 0.6366314628584495) - sin((-2.0209446503113644 - (-4.128483985539092 / x1)) / exp(x1))), loss = 9.382499642509582e-7, score = 0.04480170661063845)
 PopMember(tree = (cos(x1 / 0.6366314628584495) - sin(sin((-2.0209446503113644 - (-4.128483985539092 / x1)) / exp(x1)))), loss = 9.583132968993814e-7, score = 0.04800174515322376)
 PopMember(tree = (sin(cos(x1 / 0.556922741892079) / (((x1 + exp(x1)) + x1) - -0.18983334693465292)) + cos(x1 / 0.6366135607609045)), loss = 1.2465540540794797e-6, score = 0.06080226681628824)
 PopMember(tree = (cos(x1 / 0.6366486132211249) - (cos(x1 + -0.6342960332363662) / exp(x1))), loss = 5.174645144914829e-6, score = 0.038409419223890455)
 PopMember(tree = (sin((x1 - -1.0267283293605511) / 0.6384117132960604) / cos(-0.7430014086131702 / x1)), loss = 1.2723455785931058e-5, score = 0.03522316454183701)

Note
I ordered the model structures purely by mean squared error loss of the fits. Symbolic regression usually also incorporates a punishment for complexity of the model. I did not yet find a good way to incorporate this in the experimental design workflow, but this is definitely something that should be looked at.

Now let us turn these symbolic expressions back into executable functions. Let us try it for the first suggested model structure:

@syms x
eqn = node_to_symbolic(best_models[1].tree, options,varMap=["x"])

(cos(x / 0.556922741892079) / (((sin(x) + exp(x)) + x) + 0.8270969607022577)) + cos(x / 0.6366135607609045)

f = build_function(eqn, x, expression=Val{false})
f.(X)

10-element Vector{Float64}:
  0.9917134120127088
 -0.9202812756848684
  0.541456165631766
  0.9838841433137335
  0.9985475538293273
 -0.27639066814560703
  0.206101568512253
  0.7390207485567821
 -0.8612497963096721
  0.0005192718672333284

Now we do it for all the others and plot them:

plot(0.0:0.1:10.0,y.(0.0:0.1:10.0),lw=5,label="true model",ls=:dash);
model_structures = Function[]
for i = 1:n_best
    eqn = node_to_symbolic(best_models[i].tree, options,varMap=["x"])
    fi = build_function(eqn, x, expression=Val{false})
    x_plot = Float64[]
    y_plot = Float64[]
    for x_try in 0.0:0.1:10.0
        try
            y_try = fi(x_try)
            append!(x_plot,x_try)
            append!(y_plot,y_try)
        catch
        end
    end
    plot!(x_plot, y_plot,label="model $i");
    push!(model_structures,fi)
end
scatter!(X,Y,ms=5,label="data",ls=:dash);
plot!(xlabel="x",ylabel="y", ylims=(-1.2,1.6));
plot!(tickfontsize=12, guidefontsize=14, legendfontsize=8, grid=false)

We see that none of the suggested model structures approximate the true model well in the area between 0 and 2.5, while between 2.5 and 10 the models agree. In this case it is thus probably a good idea to gather more data for small $x$. Can we formalize this in mathematical terms? We will do this by creating a variant of T-optimal designs. T-optimal designs are model discrimination designs, where design points are sought which maximize the distance between a model thought to be correct (T for true) and some other plausible alternative model structures. Though perhaps it is better to think of the “true” model as a null hypothesis model. Design points are chosen such that the alternative models predict different values than the “true” model at these points. If the “true” model is then not correct after all, it should be easily discernible from the data.

In our situation, we do not have a model structure which can serve as the “true” model. We will instead work with all pairwise distances between the plausible model structures suggested by symbolic regression. Collecting measurements where the model structures differ greatly in predictions, will cause atleast some of the model structures to become unlikely, causing new model structures to enter the top 10. We call this S-optimal, with S for Symbolics. $$N = \text{number of measurements}$$ $$M = \text{number of models}$$ $$f_i = \text{ith model structure}$$ $$x_k = \text{kth design point}$$

$$\max_x \frac{2}{M(M-1)}\sum_{i=1}^{N}\sum_{j=i+1}^{N} \max_{k=1 \text{ to } M}\set{(f_i(x_k) - f_j(x_k))^2}$$

Note
The average over the pairwise model comparisons could be replaced with the minimum. This would lead to a max-min-max strategy instead of a max-expected-max strategy. In my experiments this did not work well when two of the suggested model structures are very similar or identical. This often occurs because of terms like $sin(x-x)$ being present in symbolic regression. Punishing for complexity might remedy this.

Now let us apply this criterion to gather 3 new measurements:

function S_criterion(x,model_structures)
    n_structures = length(model_structures)
    n_obs = length(x)
    if length(model_structures) == 1
        # sometimes only a single model structure comes out of the equation search
        return 0.0
    end
    y = zeros(n_obs,n_structures)
    for i in 1:n_structures
        y[:,i] .= model_structures[i].(x)
    end
    squared_differences = Float64[]
    for i in 1:n_structures
        for j in i+1:n_structures
            push!(squared_differences, maximum([k for k in (y[:,i] .- y[:,j]).^2]))
        end
    end
    -mean(squared_differences) # minus sign to minimize instead of maximize
end
function S_objective(x_new,(x_old,model_structures))
    S_criterion([x_old;x_new],model_structures)
end
n_batch = 3
X_new_ini = rand(design_region,n_batch)
S_objective(X_new_ini,(X,model_structures))

-0.0020153204305020994

Note
Can this be reformulated as a differentiable optimization problem, using slack variables?

lb = fill(minimum(design_region),n_batch)
ub = fill(maximum(design_region),n_batch)
prob = OptimizationProblem(S_objective,X_new_ini,(X,model_structures),lb = lb, ub = ub)
X_new = solve(prob,BBO_adaptive_de_rand_1_bin_radiuslimited(),maxtime=10.0)

┌ Warning: Using arrays or dicts to store parameters of different types can hurt performance.
│ Consider using tuples instead.
└ @ SciMLBase C:\Users\arno\.julia\packages\SciMLBase\XzPx0\src\performance_warnings.jl:33

retcode: MaxTime
u: 3-element Vector{Float64}:
 8.749678023720595e-154
 1.5708411540103517
 1.1455117200197213

We see that the 3 new observations are indeed both smaller than 2.5. Let us plot this:

Y_new = y.(X_new)
plot(0.0:0.1:10.0,y.(0.0:0.1:10.0),lw=5,label="true model",ls=:dash);
for i = 1:n_best
    x_plot = Float64[]
    y_plot = Float64[]
    for x_try in 0.0:0.01:10.0
        try
            y_try = model_structures[i](x_try)
            append!(x_plot,x_try)
            append!(y_plot,y_try)
        catch
        end
    end
    plot!(x_plot, y_plot,label="model $i");
end
scatter!(X,Y,ms=5,label="data old");
scatter!(X_new,Y_new,ms=5,label="data new");
plot!(xlabel="x",ylabel="y", ylim=(-1.2,1.8));
plot!(tickfontsize=12, guidefontsize=14, legendfontsize=8, grid=false, dpi=600)

Now, we run symbolic regression on our combined dataset:

X = [X;X_new]
Y = [Y;Y_new]
hall_of_fame = EquationSearch(X', Y, options=options, niterations=100, runtests=false, parallelism=:serial)
n_best = min(length(hall_of_fame.members),n_best_max) 
best_models = sort(hall_of_fame.members,by=member->member.loss)[1:n_best]
plot(0.0:0.01:10.0,y.(0.0:0.01:10.0),lw=5,label="true model",ls=:dash);
model_structures = Function[]
for i = 1:n_best
    eqn = node_to_symbolic(best_models[i].tree, options,varMap=["x"])
    println(eqn)
    fi = build_function(eqn, x, expression=Val{false})
    x_plot = Float64[]
    y_plot = Float64[]
    for x_try in 0.0:0.01:10.0
        try
            y_try = fi(x_try)
            append!(x_plot,x_try)
            append!(y_plot,y_try)
        catch
        end
    end
    plot!(x_plot, y_plot,label="model $i");
    push!(model_structures,fi)
end
scatter!(X,Y,ms=5,label="data");
plot!(xlabel="x",ylabel="y", ylims=(-1.2,1.8));
plot!(tickfontsize=12, guidefontsize=14, legendfontsize=8, grid=false, dpi=600)

cos(1.5707963268004939 * x) - (sin(x * -6.283185344000339) / (exp(x) - (-3.6820752424157317e-8 * (exp(x) + 1.1011581435364508))))
cos(1.5707963268004939 * x) - (sin(-6.283185344000339 * x) / (exp(x) - (-3.6820752424157317e-8 * exp(x))))
cos(1.5707963268004939 * x) - (sin(x * -6.283185344000339) / (exp(x) - ((x + 1.1011581435364508) * -3.6820752424157317e-8)))
cos(x * 1.5707963268004939) - (sin(x * -6.283185344000339) / ((exp(x) - 1.5752699418768323) + 1.5752701705815984))
cos(1.5707963268004939 * x) - (sin(x * -6.283185344000339) / (exp(x) - (-3.6820752424157317e-8 * x)))
cos(1.5707963268004939 * x) - (sin(x * -6.283185344000339) / (exp(x) - -3.6820752424157317e-8))
cos(x * 1.5707963268004939) - (sin(-6.283185344000339 * x) / exp(x))
cos((x - -2.973035511273684e-6) * 1.5707957537647865) - (sin(x * -6.2831665710382785) / exp(x))
cos(x * 1.5707791141382517) - sin(sin(-6.286309274467023 * x) / exp(x))
cos(x * 1.5718434218299882) - (-0.030041072681216516 / (1.2656975994705448 - x))

Et voilà, we found the correct model structure, with only 3 new observations!

Note
In fact we found it multiple times, with expressions like $sin(x-x)$. Again, punishing for needless complexity would be of added value here.