This is the fourth part in my Intro to Bayesian Optimization

So far we have described the Bayesian Optimization framework, and have a basic understanding of Bayesian statistics. In BO, we have no knowledge about the function, so the model we use needs to be able to describe arbitrary continuous functions.

Gaussian Process regression is a nonparametric method to model arbitrary continuous functions given limited data. Sounds perfect for our situation. Most literature and implementations of BO use GPs for this reason. However the usefulness of GPs is wide-ranging, and the study of GPs is a large research area.

A change of perspective

Let’s say we have 5 data points. Each with an $x$ and a $y$ value. Let’s ignore the $x$ values for now and just think of our vector of $y$ values:

[y_{1}, y_{2}, y_{3}, y_{4}, y_{5}]

Now, we could think of it as 5 points in 1-dimensional space, but if we tilt our head, we could also interpret this as 1 point in 5-dimensional space.

What if all the data points in our infinite sample space are just a single point in infinite-dimensional space, With every possible point corresponding to its own dimension?

The Gaussian Process

Gaussian Process regressions take this perspective, then define a Gaussian distribution over this infinite-dimensional space, which we call our function space. A point in function space now corresponds to a realization of the data (a function) in our original data space. We pick a Gaussian Process as the prior distribution over the function space.

The specific definition is that a Gaussian Process is any (possibly infinite) collection of Random Variables, such that any finite collection of them have a joint Gaussian distribution. So here we define a GP over infinite-dimensional space, and any finite subset (e.g. our data points) is also Gaussian in its joint distribution.

Note that we are not saying that our $y$ values are gaussian with respect to the $x$ values. We are saying that the $y$ values themselves fall out of a Gaussian distribution with respect to each other. This will become clearer when we visualize the prior.

Kernels

Now think about what it means for a function to be continuous, it means an arbitrarily small change in our $x$ variables would result in a bounded change in our $y$ variable, i.e. a small $x$ change is a small $y$ change. We can encode this in our new space by saying that if our change in $x$ between two points is small, the correlation in the corresponding dimensions of the function space is large because the $y$ values should be similar. We call this mapping of distance of two points to covariance a kernel $k (x, x^{'})$ where $x$ and $x^{'}$ are two points.

A common kernel is the radial basis function (RBF) aka the exponentiated quadratic kernel:

k (x, x^{'}) = e^{- \frac{1}{2} ∥ x - x^{'} ∥^{2}} where ∥ x - x^{'} ∥ is the euclidean distance between x and x^{'}

When $x$ and $x^{'}$ are close, $k (x, x^{'})$ goes to 1, as they go far away, $k (x, x^{'})$ goes to 0. Note this works for any number of dimensions of explanatory variables. We actually include two hyperparameters into the RBF for tuning, $σ_{f}$ controls the amplitude and $l$ controls how fast covariance drops off:

k (x, x^{'}; σ_{f}, l) = σ_{f}^{2} e^{- \frac{1}{2 l ^{2}} ∥ x - x^{'} ∥^{2}}

There are many other reasonable choices of kernel you can use depending on what you know about the function you are approximating (e.g. if your function is periodic and you want to encode that). Read more

So if we some response variable vector $y = [y_{1}, y_{2}, \dots, y_{n}]$ and some explanatory variables $X = [x_{1}^{T}, x_{2}^{T}, \dots, x_{n}^{T}]$ , then our prior distribution for $y$ is the Gaussian Process:

y = f (x_{1}) f (x_{2}) ⋮ f (x_{n}) \sim N (0, K (X, X)) where K (X, X) = k (x_{1}, x_{1}) k (x_{2}, x_{1}) ⋮ k (x_{n}, x_{1}) k (x_{1}, x_{2}) k (x_{2}, x_{2}) k (x_{n}, x_{2}) \dots ⋱ \dots k (x_{1}, x_{n}) k (x_{2}, x_{n}) ⋮ k (x_{n}, x_{n})

We may choose the mean not to be 0, but some function $μ (x)$ in practice. But knowing nothing else about the function, f(x), a mean of 0 is a reasonable choice.

Sampling the prior

So now we have a distribution that describes our function space. Remember, our reasoning here was to find a pretty general way to model arbitrary continuous functions. If we were going to sample a point, what would the function look like?

The Gaussian part of Gaussian Processes refers to these wiggly curves. Note they are not shaped as Gaussians; we are not saying that the function has a Gaussian shape, we are saying that the points of the function are a priori one sample from an infinite-dimensional Gaussian distribution with a covariance described by the kernel. Some wiggly curves are more likely than others, note how all of these ones centre around 0, that is because of our choice of mean of the GP. The probability of the curves is Gaussian, and that’s what gives them their arbitrary continuous (wiggly) shape.

Consider the earlier example of Bayesian linear regression. That was in a sense Gaussian too, as we said our two parameters came from a joint normal distribution. Same thing here, except instead of parameters, we consider all infinitely many points on the real number line.

Visualizing the prior

We can take the limit of the number of samples and overlay them to visualize the probability distribution of the prior at each point $x$ .

Posterior

We start with a very useful theorem. If we have a Gaussian Process, then by definition the joint distribution of any two subsets of the GP are also Gaussian. And it turns out the conditional distribution of Gaussians is also Gaussian.

Specifically, if you consider two subsets of random variables of a Gaussian Process, $y_{1}$ and $y_{2}$

[y_{1} y_{2}] \sim N ([μ_{1} μ_{2}], [Σ_{11} Σ_{21} Σ_{12} Σ_{22}])

then

p (y_{2} ∣ y_{1}) = N (μ_{2∣1}, Σ_{2∣1})

where

μ_{2∣1} = μ_{2} + Σ_{21} Σ_{11}^{- 1} (y_{1} - μ_{1}) and Σ_{2∣1} = Σ_{22} - Σ_{21} Σ_{11}^{- 1} Σ_{12}

So if we have observed data $y_{d a t a}, X_{d a t a}$ , and want to calculate the posterior for prediction points $y_{*}, X_{*}$ , we have prior:

[y_{d a t a} y_{*}] \sim N ([00], [K K_{*}^{T} K_{*} K_{**}])

where

K = K (X_{d a t a}, X_{d a t a}), K_{*} = K (X_{d a t a}, X_{*}), K_{**} = K (X_{*}, X_{*})

then we have the posterior:

p (y_{*} ∣ y_{d a t a}) = N (K_{*}^{T} K^{- 1} y_{d a t a}, K_{**} - K_{*}^{T} K^{- 1} K_{*})

Note that $y_{*}$ doesn’t need to come from a small or even finite set, it can be the entire infinite-dimensional space which is equivalent to every point on the number line.

Sampling the posterior

Sampling a few times

Sampling many times

As we have a closed form solution for the posterior (one of the nice things about GPs), we can visualize it without sampling. Just specify your level of credibility.

Noisy observations

Say instead we have $y = f (x) + ε$ , where measurements have error

ε \sim N (0, σ_{m}^{2})

Then the prior covariance matrix of the data points is

K = K (X_{d a t a}, X_{d a t a}) + σ_{m}^{2} I

and we can just replace this in the earlier process.

Note that this gives the posterior distribution, i.e. the distribution for the noiseless function, not the predictive distribution for future measurements which will need to have noise added.

p (f (X_{*}) ∣ y_{d a t a})

In the noiseless case, we called $p (y_{*} ∣ y_{d a t a})$ the posterior, but that was just because of the special case where $y = f (x)$

Prior on the functions? Where are the parameters?

In the linear regression case, we defined a prior of the parameters, not the functions that they create. Here, we set the prior on the functions directly:

p (f (x) ∣ y) = \frac{p ( y ∣ f ( x )) \cdot p ( f ( x ))}{p ( y )}

That is because Gaussian Process regression is a nonparametric method. This is a catch-all term for any model that doesn’t use specified models with parameters that can be optimized — including histograms, decision trees, and some support vector machines. However, often when we use this term, what we are actually considering is infinitely many parameters.

With a GP, the complexity of the model (i.e. its number of dimensions) increases with the number of data points. Contrast this with linear regression where complexity increases with the number of explanatory variables. Scaling complexity with the number of data points has a huge advantage in low data models. Consider a linear regression, where our fit can be characterized just by the coefficients. No matter how many data points we have, in a one-dimensional case, we can compress the information of the model down into an intercept and coefficient term. However, with GPs, we are never compressing/losing information, in a sense, we are maximally leveraging the data we have available.

The flip side of this is that GPs are computationally constrained to low data regimes. We have won statistical efficiency at the expense of computational efficiency. Practically speaking, we must calculate the inversion of the Kernel matrix of the observed data $K^{- 1}$ , which is an expensive $O (n^{3})$ operation where $n$ is the number of data points. Compare this with a linear regression where we must calculate $(X^{T} X)^{- 1}$ which is an $O (p^{3})$ operation, where $p$ is the number of parameters. With GPs we are constrained to working at about 10,000 data points given modern tooling. However, this is fine for Bayesian Optimization where we don’t have so many data points. Also, we do not require this model to be computationally performant, as most time should be spent in function evaluation, i.e. performing experiments. It’s ok for modelling to take a few minutes if the time between experiments is a few hours.

Hyperparameter selection

We have introduced 3 hyperparameters. The kernel has terms $l$ and $σ_{f}$ , and with noise we introduced $σ_{m}$ . A common method used to pick these is to maximize the marginal likelihood $ar g max_{l, σ_{f}, σ_{m}} p (y ∣ X, l, σ_{f}, σ_{m})$ which can be done through gradient methods.

However, there are arguments to be made in the context of Bayesian Optimization that we don’t want to find optimal model hyperparameters, as we may want to err on the side of overestimating the kernel variance so we don’t get stuck in local minima through the iterative process.

Summary

We want to find the combination of $x$ values that gives us an optimal $y$ value, where $y = f (x) + ε$ . $f$ is a black box we can sample, but we cannot sample its derivatives. Sampling is expensive so we want to minimize the number of samples we take.

The approach we take is to create a probabilistic (Bayesian) model of the function with the data we have, then find the right balance of exploration and exploitation to pick the next point to sample (maximizing the acquisition function).

We use a Gaussian Process for this probabilistic model, because it can model arbitrary continuous functions, and is powerful in low data regimes.

Start with some initial sample measurements
Fit a Gaussian Process to the data
Compute the acquisition function based on the GP
Find $x^{*}$ that maximizes the acquisition function
Sample $f (x^{*})$ and add to your data
If time/resources are remaining, go to step 2

Itsi Weinstock

Projects

blog

Les Noces

Intro to Bayesian Optimization 4. Gaussian Processes

Intro to Bayesian Optimization 3. Intro to Bayesian Statistics

Projects

blog

Les Noces

Intro to Bayesian Optimization 4. Gaussian Processes

Intro to Bayesian Optimization 3. Intro to Bayesian Statistics

Intro to Bayesian Optimization 4. Gaussian Processes

A change of perspective

The Gaussian Process

Kernels

Sampling the prior

Visualizing the prior

Posterior

Sampling the posterior

Noisy observations

Prior on the functions? Where are the parameters?

Hyperparameter selection

Summary

Backlinks