Test Functions for Bayesian Optimisation
Bayesian optimisation is a powerful optimisation technique for black-box functions and processes with expensive evaluations. It is popular for hyperparameter tuning in machine learning, but has many real-world applications as well.
At the centre of Bayesian optimisation is the objective function that we are trying to optimise. The objective function is usually expensive to evaluate, so before we start using Bayesian optimisation on the actual problem, we want to make sure that our code and models are working. In this post, we discuss a set of functions that can help us test and gauge the efficacy of our models. Along with the discussions are implementations in R.
library(ggplot2)
library(magrittr)
set.seed(4444)
<- `dim<-` set_dim
Objective Functions in Bayesian Optimisation
The objective function or process, \(f(\mathbf{x})\), is a map from the input space \(\mathbf{x} \in \mathcal{X}\) to a scalar output \(y \in \mathbb{R}\):
\[f: \mathcal{X} \rightarrow \mathbb{R}\]
The true nature of this process is unknown and, for real world problems, might often be very difficult to traverse. A good set of test functions should challenge a model with different scenarios so we can gauge the strengths and weaknesses of our model.
An Example Model
In Bayesian optimisation, we model the objective function using a surrogate model, typically a Gaussian process (GP), which is cheap to evaluate and provides uncertainty estimates. Bayesian optimisation then uses an acquisition function to decide on the next point to evaluate in the search space.
To demonstrate the test functions we will implement a GP with an RBF kernel. For the acquisition function, we will use Expected Improvement (EI).
Show the code
#' RBF Kernel
#'
#' @param X1 matrix of dimensions (n, d). Vectors are coerced to (1, d).
#' @param X2 matrix of dimensions (m, d). Vectors are coerced to (1, d).
#' @param l length scale
#' @param sigma_f scale parameter
#'
#' @return matrix of dimensions (n, m)
<- function(X1, X2, l = 1.0, sigma_f = 1.0) {
rbf_kernel if (is.null(dim(X1))) dim(X1) <- c(1, length(X1))
if (is.null(dim(X2))) dim(X2) <- c(1, length(X2))
<- (- 2*(X1 %*% t(X2))) %>%
sqdist add(rowSums(X1**2, dims = 1)) %>%
sweep(2, rowSums(X2**2, dims = 1), `+`)
**2 * exp(-0.5 / l**2 * sqdist)
sigma_f
}
#' Get Parameters of the Posterior Gaussian Process
#'
#' @param kernel kernel function used for the Gaussian process
#' @param X_pred matrix (m, d) of prediction points
#' @param X_train matrix (n, d) of training points
#' @param y_train column vector (n, d) of training observations
#' @param noise scalar of observation noise
#' @param ... named parameters for the kernel function
#'
#' @return list of mean (mu) and covariance (sigma) for the Gaussian
<- function(kernel, X_pred, X_train, y_train, noise = 1e-8, ...) {
posterior if (is.null(dim(X_pred))) dim(X_pred) <- c(length(X_pred), 1)
if (is.null(dim(X_train))) dim(X_train) <- c(length(X_train), 1)
if (is.null(dim(y_train))) dim(y_train) <- c(length(y_train), 1)
<- kernel(X_train, X_train, ...) + noise**2 * diag(dim(X_train)[[1]])
K <- kernel(X_train, X_pred, ...)
K_s <- kernel(X_pred, X_pred, ...) + 1e-8 * diag(dim(X_pred)[[1]])
K_ss <- solve(K)
K_inv <- (t(K_s) %*% K_inv) %*% y_train
mu <- K_ss - (t(K_s) %*% K_inv) %*% K_s
sigma list(mu = mu, sigma = sigma)
}
#' Gaussian Negative log-Likelihood of a Kernel
#'
#' @param kernel kernel function
#' @param X_train matrix (n, d) of training points
#' @param y_train column vector (n, d) of training observations
#' @param noise scalar of observation noise
#'
#' @return function with kernel parameters as input and negative log likelihood
#' as output
<- function(kernel, X_train, y_train, noise) {
nll function(params) {
<- dim(X_train)[[1]]
n <- rlang::exec(kernel, X1 = X_train, X2 = X_train, !!!params)
K <- chol(K + noise**2 * diag(n))
L <- backsolve(r = L, x = forwardsolve(l = t(L), x = y_train))
a 0.5*t(y_train)%*%a + sum(log(diag(L))) + 0.5*n*log(2*pi)
}
}
#' Gaussian Process Regression
#'
#' @param kernel kernel function
#' @param X_train matrix (n, d) of training points
#' @param y_train column vector (n, d) of training observations
#' @param noise scalar of observation noise
#' @param ... parameters of the kernel function with initial guesses. Due to the
#' optimiser used, all parameters must be given and the order unfortunately
#' matters
#'
#' @return function that takes a matrix of prediction points as input and
#' returns the posterior predictive distribution for the output
<- function(kernel, X_train, y_train, noise = 1e-8, ...) {
gpr <- nll(kernel, X_train, y_train, noise)
kernel_nll <- list(...)
param <- optim(par = rep(1, length(param)), fn = kernel_nll)
opt <- opt$par
opt_param function(X_pred) {
<- rlang::exec(
post
posterior,kernel = kernel,
X_pred = X_pred,
X_train = X_train,
y_train = y_train,
noise = noise,
!!!opt_param
)list(
mu = post$mu,
sigma = diag(post$sigma),
Sigma = post$sigma,
parameters = set_names(opt_param, names(param))
)
}
}
#' Expected Improvement Acquisition Function for a Gaussian Surrogate
#'
#' @param mu vector of length m. Mean of a Gaussian process at m points.
#' @param sigma vector of length m. The diagonal of the covariance matrix of a
#' Gaussian process evaluated at m points.
#' @param y_best scalar. Best mean prediction so far on observed points
#' @param xi scalar, exploration/exploitation trade off
#' @param task one of "max" or "min", indicating the optimisation problem
#'
#' @return EI, vector of length m
<- function(mu, sigma, y_best, xi = 0.01, task = "min") {
expected_improvement if (task == "min") imp <- y_best - mu - xi
if (task == "max") imp <- mu - y_best - xi
if (is.null(imp)) stop('task must be "min" or "max"')
<- imp / sigma
Z <- imp * pnorm(Z) + sigma * dnorm(Z)
ei == 0.0] <- 0.0
ei[sigma
ei
}
#' Plot of a Gaussian Process in One Dimension
#'
#' @param mu vector of length m. Mean of a Gaussian process at m points.
#' @param sigma vector of length m. The diagonal of the covariance matrix of a
#' Gaussian process evaluated at m points.
#' @param X_pred matrix of dimensions (m X 1) representing m prediction points
#' with one dimension.
#' @param X_train matrix of dimensions (n X 1) representing n training points
#' with one dimension
#' @param y_train vector of length n representing n observations at points
#' X_train
#' @param true_function function representing the objective function (in real
#' life, this function is unknown and cannot be plotted)
#'
#' @return ggplot2 plot
<- function(mu, sigma, X_pred, X_train, y_train, true_function) {
gp_1d_plot ::tibble(
tibblem = mu,
uncertainty = 1.96*sqrt(sigma),
upper = m + uncertainty,
lower = m - uncertainty,
x = X_pred,
f = true_function(X_pred)
%>%
) ggplot(aes(x = x)) +
geom_line(aes(y = m, colour = "Mean")) +
geom_ribbon(
aes(ymin = lower, ymax = upper, fill = "89% interval"),
alpha = 0.2
+
) geom_point(
data = tibble::tibble(x = X_train, y = y_train),
aes(x = x, y = y, shape = "Training point"),
colour = "#fb8500",
size = 4
+
) geom_line(mapping = aes(y = f, colour = "True function")) +
scale_shape_manual(values = c("Training point" = "+")) +
scale_fill_manual(values = c("89% interval" = "#219ebc")) +
labs(shape = "") +
theme_minimal() +
labs(
y = "y",
x = "",
colour = "",
fill = ""
+
) theme(panel.grid = element_blank(), axis.text.x = element_blank())
}
#' Plot of Acquisition Function with Surrogate in One Dimension
#'
#' @X_pred matrix of dimensions (m X 1) representing m prediction points with
#' one dimension.
#' @acquisition_function vector of length m representing the acquisition
#' function evaluated at the m points of X_pred
#' @param uncertainty_plot the plot of a surrogate model in one dimension
#' @param xt1 scalar, the point, x, that optimises the acquisition function
#' @param label character, label for the acquisition function
#' @param title character, a title for the plot
#'
#' @return ggplot2 plot
<- function(X_pred,
acquisition_plot
acquisition_function,
uncertainty_plot,
xt1,label = "EI",
title = "") {
<- tibble::tibble(
p1 x = X_pred,
a = acquisition_function
%>%
) ggplot() +
geom_line(aes(x = x, y = a, colour = label)) +
geom_vline(xintercept = xt1, linetype = 2) +
theme_minimal() +
labs(x = "", y = label, colour = "") +
theme(panel.grid = element_blank())
<- uncertainty_plot +
p2 geom_vline(xintercept = xt1, linetype = 2) +
labs(title = title)
<- cowplot::align_plots(p2, p1 , align = "v")
aligned_plots ::plot_grid(aligned_plots[[1]], aligned_plots[[2]], ncol = 1)
cowplot
}
<- function(X_train,
gp_rbf_ei_1d
X_pred,
demo_function,noise = 1e-8,
title = "" ) {
<- demo_function(X_train) + rnorm(nrow(X_train), 0, noise)
y_train <- gpr(
gp kernel = rbf_kernel,
X_train = X_train,
y_train = y_train,
noise = noise,
l = 1,
sigma_f = 1
)<- gp(X_pred)
post_pred <- post_pred$mu
mu <- post_pred$sigma
sigma <- expected_improvement(mu = mu, sigma = sigma, y_best = min(y_train))
ei <- gp_1d_plot(
gp_plot mu = mu,
sigma = sigma,
X_pred = X_pred,
X_train = X_train,
y_train = y_train,
true_function = demo_function
)acquisition_plot(
X_pred = X_pred,
acquisition_function = ei,
uncertainty_plot = gp_plot,
xt1 = X_pred[which.max(ei)],
label = "EI",
title = paste("Gaussian Process Surrogate", title)
) }
For each test function, we will condition the GP on a set of six training points. Here is an example applied to a simple function:
<- function(x) sin(12 * x) * x + 0.5 * x^2
demo_objective_function <- matrix(c(0.02, 0.3, 0.55, 0.75, 0.8, 0.98), 6, 1)
X_train <- 0.05
noise <- demo_objective_function(X_train) + rnorm(6, 0, noise)
y_train <- matrix(seq(0, 1, length.out = 100), 100, 1)
X_pred gp_rbf_ei_1d(
X_train = X_train,
X_pred = X_pred,
demo_function = demo_objective_function,
noise = noise
)
The vertical dashed line indicates the next point that would be evaluated in our hypothetical Bayesian optimisation scenario.
Test Functions with Continuous Dimensions
We start out with test functions that are defined for any number, \(d\) of continuous dimensions.
When doing Bayesian optimisation, it is prudent to normalise the input dimensions such that the search space is confined to the unit hypercube:
\[\mathcal{X} = [0,1]^d\]
Most of the following functions are not defined in the unit hypercube, however. So along with each function, we also define a scaling function to scale the input dimensions from the unit hypercube to the range of the function.
Before we get started, we define a nice surface plot to visualise each function.
Show the code
<- function(fun, scale) {
surface_plot_2d <- 300
n <- scale(seq(0, 1, length.out = n))
x <- expand.grid(x , x) %>%
y fun() %>%
set_dim(c(n, n))
::plot_ly(
plotlyx = x,
y = x,
z = y,
showscale = FALSE
%>%
) ::add_surface(
plotlycontours = list(
z = list(
show = TRUE,
start = min(y),
end = max(y),
size = (max(y) - min(y)) / 15,
color = "white",
usecolormap = TRUE,
highlightcolor = "#ff0000",
project = list(z = TRUE)
)
)%>%
) ::layout(
plotlyplot_bgcolor = "rgba(0,0,0,0)",
paper_bgcolor = "rgba(0,0,0,0)",
scene = list(camera = list(eye = list(x = 2.2, y = 1.8, z = 1.5)))
) }
Ackley
There are several Ackley test functions, one of them is defined as [1]:
\[ \begin{array}{cc} f(\mathbf{x}) = & -20\exp\left(-0.2\sqrt{\frac{1}{d}\sum_{i=1}^dx_i^2}\right) \\ & - \exp\left(\frac{1}{d}\sum_{i=1}^d \cos(2\pi x_i)\right) \\ & + 20 + \exp(1) \end{array} \]
where \(d\) is the number of dimensions. Sometimes the function is defined with adjustable parameters in place of the constants, but the constants used here appear often [2].
<- function(X) {
ackley if (is.null(dim(X))) set_dim(X, c(1, length(X)))
<- ncol(X)
d <- -20 * exp(-0.2 * sqrt(1 / d * rowSums(X^2)))
part1 <- -exp(1 / d * rowSums(cos(2 * pi * X)))
part2 + part2 + 20 + exp(1)
part1 }
\(\mathbf{x}\) is limited to \(-35 > x_i < 35\), though the range \(-5 < x_i < 5\) is plenty challenging. The global minimum of the Ackley function is \(f(\mathbf{x})=0\) at \(\mathbf{x}=(0,0,...,0)\).
Show the code
<- function(X) X * 10 - 5
ackley_scale
surface_plot_2d(ackley, ackley_scale)
When conditioning a GP with an RBF kernel on the 1D Ackley function, the global trend is fairly easily captured. However, the local periodic component is not captured. The use of a periodic kernel along with the RBF kernel would possibly remedy that, but it is not necessary when we are looking for the global minimum.
gp_rbf_ei_1d(
X_train = ackley_scale(X_train),
X_pred = ackley_scale(X_pred),
demo_function = ackley,
noise = 0.4,
title = "on the 1D Ackley Function"
)
Michalewicz
The Michalewicz test function defined as [1]:
\[f(\mathbf{x}) = -\sum_{i = 1}^d\sin(x_i)\sin^{2m}\left(\frac{ix_i^2}{\pi}\right)\]
where \(m\) is a constant commonly set to 10.
<- function(X, m = 10) {
michalewicz <- seq_len(dim(X)[[2]]) / pi
i <- t(i * t(X**2))
s - rowSums(sin(X) * sin(s)**(2 * m))
}
The function is usually evaluated in the range \(x_i \in [0, \pi]\) for all dimensions.
The Michalewicz test function features large swaths of space with virtually no gradient, making it very hard to optimise unless a training point happens to be close to the global optimum.
Show the code
<- function(X) X * pi
michalewicz_scale
surface_plot_2d(michalewicz, michalewicz_scale)
The elusiveness of the global optimum is only exacerbated in higher dimensions. The location and value of the global minimum depends on the number of dimensions, but in two dimensions it is [3]:
\[f(x_1 = 2.20, x_2 = 1.57) \approx -1.80\]
It is difficult to capture the overall trend of the Michalewicz function with a GP, even in 1D. Unless a training point happens to be close to the optimum, there is very little information to go on. When looking at the uncertainties reported by the GP, this results in an almost uniform band along the main plane of the function.
gp_rbf_ei_1d(
X_train = michalewicz_scale(X_train),
X_pred = michalewicz_scale(X_pred),
demo_function = michalewicz,
noise = 0.05,
title = "on the 1D Michalewicz Function"
)
Rastrigin
The Rastrigin function is defined as [3]:
\[f(\mathbf{x}) = 10d + \sum_{i=1}^dx_i^2-10\cos(2\pi x_i)\]
<- function(X) 10 * dim(X)[[2]] + rowSums(X^2 - 10 * cos(2 * pi * X)) rastrigin
The usual evaluation range is \(-5.12 < x_i < 5.12\).
The function is essentially a large bowl shape with a periodic element that causes the function to have many local minima. The global minimum of th Rastrigin function is \(f(\mathbf{x})=0\) at \(\mathbf{x}=(0,0,...,0)\).
Show the code
<- function(X) X * 10.24 - 5.12
rastrigin_scale
surface_plot_2d(rastrigin, rastrigin_scale)
Like for the Ackley function, a GP with an RBF kernel is incapable of capturing the periodic element of the test function but is enough for capturing the large scale trend, which is sufficient to find the global optimum. Bayesian optimisation is hampered by the deep local minima though.
gp_rbf_ei_1d(
X_train = rastrigin_scale(X_train),
X_pred = rastrigin_scale(X_pred),
demo_function = rastrigin,
noise = 4,
title = "on the 1D Rastrigin Function"
)
Styblinski-Tang
The Styblinski-Tang function is defined as [1]:
\[f(\mathbf{x}) = 0.5\sum_{i=1}^dx_i^4-16x_i^2+5x_i\]
<- function(X) 0.5 * rowSums(X^4 - 16 * X^2 + 5 * X) styblinski_tang
The usual evaluation range is \(-5 < x_i < 5\) for all dimensions and the global minimum is \(f(\mathbf{x}) = -39.17d\) for \(x_i = -2.903534\) in all dimensions.
The function is a large bowl with multiple local minima that are close in absolute value to the global minimum.
Show the code
<- function(X) X * 10 - 5
styblinski_tang_scale
surface_plot_2d(styblinski_tang, styblinski_tang_scale)
A GP with an RBF kernel makes quick work of the Styblinski-Tang function, and we should be able to find the global minimum with just a few evaluations in 1D.
gp_rbf_ei_1d(
X_train = styblinski_tang_scale(X_train),
X_pred = styblinski_tang_scale(X_pred),
demo_function = styblinski_tang,
noise = 4,
title = "on the 1D Styblinski-Tang Function"
)
Note how the next evaluation would have yielded the global minimum.
Zakharov
The Zakharov function is defined as [1]:
\[f(\mathbf{x}) = \sum_{i=1}^d\left(x_i^2\right) + \left(\sum_{i=1}^d0.5ix_i\right)^2+ \left(\sum_{i=1}^d0.5ix_i\right)^4\]
<- function(X) {
zakharov <- rowSums(t(t(X) + dim(X)[[2]]))
sum_ix rowSums(X^2) + sum_ix^2 + sum_ix^4
}
The usual evaluation range is \(-5 < x_i < 10\) for all dimensions and the global minimum is \(f(\mathbf{x}) = 0\) for \(x_i = 0\) in all dimensions.
The function is a bent plane that has a very small derivative close to the global minimum.
Show the code
<- function(X) X * 15 - 10
zakharov_scale
surface_plot_2d(zakharov, zakharov_scale)
Despite the relatively simple looks of the function, it is very difficult for our GP with an RBF kernel to find the global minimum, even in 1D. This is because the RBF kernel is stationary, and the Zakharov function has vastly different length scales for different areas of search space. Combining the RBF kernel with a non-stationary kernel could probably remedy that.
gp_rbf_ei_1d(
X_train = zakharov_scale(X_train),
X_pred = zakharov_scale(X_pred),
demo_function = zakharov,
noise = 100,
title = "on the 1D Zakharov Function"
)
Test Functions in 2D
While it is rare to have a real world problem that features exactly two input dimensions, limiting ourselves to just two continuous inputs allows us to employ some quite difficult test functions.
Langermann
The Langermann functions is defined as [2]:
\[f(\mathbf{x}) = \sum_{j=1}^5c_j\exp\left(-\frac{1}{\pi}\sum_{i=1}^d\left(x_i-A_{ij}\right)^2\right)\cos\left(\pi\sum_{i=1}^d\left(x_i-A_{ij}\right)^2\right)\]
Where \(\mathbf{c} = [1, 2, 5, 2, 3]\) and
\[ \mathbf{A} = \left[\begin{array}{ccc} 3 & 5 \\ 5 & 2 \\ 2 & 1 \\ 1 & 4 \\ 7 & 9 \end{array}\right] \]
<- function(X) {
langermann <- dim(X)[[2]]
d <- c(1, 2, 5, 2, 3)
cvec <- matrix(c(3, 5, 2, 1, 7, 5, 2, 1, 4, 9)[1:(d * 5)], nrow = 5)
A ::map(1:5, function(j) {
purrr<- rowSums(t(t(X) - A[j,])^2)
xa * exp((-1 / pi) * xa) * cos(pi * xa)
cvec[[j]] %>%
}) do.call(cbind, .) %>%
rowSums()
}
The usual evaluation range is \(0 < x_i < 10\) for each dimension and the global minimum is \(f(\mathbf{x}) = 0\) for \(x_i = 0\) in all dimensions.
The function features a very diverse landscape with many local minima as well as areas with relatively small gradients.
Show the code
<- function(X) X * 10
langermann_scale
surface_plot_2d(langermann, langermann_scale)
It is very difficult for a GP with an RBF kernel to capture the structure of the Langerman function, even in 1D. Given a small budget for training points, the best we can hope for is a good local minimum.
gp_rbf_ei_1d(
X_train = langermann_scale(X_train),
X_pred = langermann_scale(X_pred),
demo_function = langermann,
noise = 0.1,
title = "on the 1D Langermann Function"
)
Shubert
The Shubert function is defined as [2]:
\[\sum_{j=1}^5\cos\left((j+1)x_1+j\right)\cos\left((j+1)x_2+j\right)\]
The function is easily extended to any number of \(d\) dimensions [1] [3], and the implementation below reflects that. However, given the product across dimensions, the function may be numerically challenged in higher dimensions.
<- function(X) {
shubert ::map(1:5, ~ .x*cos((.x + 1) * X + .x)) %>%
purrr::reduce(`+`) %>%
purrrapply(1, prod)
}
The usual range of evaluation is \(-10 < x_i < 10\) for all dimensions.
The function features multiple global minima at \(f(\mathbf{x}) = −186.73\). I 2D there are 18 of them in the evaluation range.
Show the code
<- function(X) X*20 - 10
shubert_scale
surface_plot_2d(shubert, shubert_scale)
The Shubert function features periodic elements at different scales, which an RBF kernel cannot model. In 1D, Bayesian optimisation using a GP with an RBF kernel will probably still find a global minimum fairly quickly, however.
gp_rbf_ei_1d(
X_train = shubert_scale(X_train),
X_pred = shubert_scale(X_pred),
demo_function = shubert,
noise = 0.1,
title = "on the 1D Shubert Function"
)
References
License
The content of this project itself is licensed under the Creative Commons Attribution-ShareAlike 4.0 International license, and the underlying code is licensed under the GNU General Public License v3.0 license.