Day 2

8:30- 9:00 Arrivals

9:00- 9:30 Adaptive Bayesian Quadrature for Approximate Inference -- a Case Study Alexandra Gessner, Max Planck Institute for Intelligent Systems

Despite its algebraic beauty, Bayesian quadrature (BQ) has found little practical use in machine learning. This is due to its cubic computational cost in the number of design points, and the fact that GP-based BQ does not scale to high-dimensional tasks. So what is necessary to make Bayesian integration a useable tool for approximate inference? I will report on an ongoing study that specifically targets GP binary classification (Bayesian logistic regression). In this concrete application, a nonlinearly-warped GP prior that gives rise to an adaptive BQ scheme, combined with careful algorithmic design, can make BQ practicable even for high-dimensional integration problems.

9:45- 10:15 A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Richard Turner, University of Cambridge [slides]

The application of GPs is limited by computational and analytical intractabilities that arise when data are sufficiently numerous or when employing non-Gaussian models. A wealth of GP approximation schemes have been developed over the last 15 years to address these key limitations. Many of these schemes employ a small set of pseudo data points to summarise the actual data. We have developed a new pseudo-point approximation framework using Power Expectation Propagation (Power EP) that unifies a large number of these pseudo-point approximations. The new framework is built on standard methods for approximate inference (variational free-energy, EP and power EP methods) rather than employing approximations to the probabilistic generative model itself. In this way all of approximation is performed at `inference time' rather than at `modelling time' resolving awkward philosophical and empirical questions that trouble previous approaches. Crucially, we demonstrate that the new framework includes new pseudo-point approximation methods that outperform current approaches on regression, classification and state space modelling tasks in batch and online settings.

10:30- 11:00 Coffee

11:00- 11:30 Advances in using GPs with derivative observations Eero Siivola, Aalto University [slides]

Derivative observations are a powerful and easy way to add information to GPs. However, they are not used as much as they could be. Our recent research has found several ways of how derivatives can be used to make GPs even more powerful, even if the derivatives can not be observed. In this talk I will discuss how derivative information can efficiently be used to detect monotonic functions, how it can help in finding minimum energy paths for physicists and how it can be used in Bayesian Optimization to reduce the number of acquisitions. Last we shortly discuss how the advances in EP methods allow us to scale the methods to even larger data sets.

11:45- 12:15 Latent Gaussian Process Regression Carl Henrik Ek, University of Bristol

An attractive property of Gaussian processes is that, through very simple means, it is possible to formulate priors that are both interpretable and expressive. Examples are covariance functions, such as the squared exponential, which with a single parameter encode a global smoothness structure. However, for many types of data these global assumptions are not valid. When moving from modelling stationary to non-stationary covariances, the prior assumption changes from that of a global structure to one of input dependent, local structures. In this talk I will show how one can learn such functions by extending the input space by a latent variable. To learn this model we need to marginalise over this latent space something which is analytically intractable. We will show how one can combine the traditional variational learning of GP-LVMs with a simple sampling based approach to learn models which are capable of modelling non-stationary and multi-modal functions.

12:30- 13:30 Lunch

13:30- 14:00 Stochastic (Partial) Differential Equation Methods for Gaussian Processes Simo Särkkä, Aalto University

Stochastic partial differential equations and stochastic differential equations can be seen as alternatives to kernels in representation of Gaussian processes in machine learning and inverse problems. Linear operator equations correspond to spatial kernels, and temporal kernels are equivalent to linear Ito stochastic differential equations. The differential equation representations allow for the use of differential equation numerical methods on Gaussian processes. For example, finite-differences, finite elements, basis function methods, and Galerkin methods can be used. In temporal and spatio-temporal case we can use linear-time Kalman filter and smoother approaches.

14:15- 14:45 Efficient and principled score estimation Heiko Strathmann, Gatsby Unit, UCL [slides]

We introduce and analyse an algorithm to efficiently estimate a dataset's score function, i.e the derivative of the log-density. Our work builds on a recently proposed score-matching estimator for an infinite dimensional exponential family model in a reproducing kernel Hilbert space. To overcome the estimator's prohibitive computational costs, cubic in both dimensions D and samples N, we apply the Nyström method':' by representing the solution to the estimation problem using only a sub-sample of all data, we significantly reduce run-time and memory usage. We present initial and promising work towards both consistency of the approximate estimator and generalisation error analysis for the random design setting, using ideas from recent theoretical breakthroughs for Nyström kernel least-squares. We compare our method to the popular de-noising autoencoder and previous approximations of the kernel model. In addition to the lack of theoretical analysis of the auto-encoder methods, an empirical comparison shows that our estimator performs favourably':' it is more data-efficient, has fewer parameters which can be tuned in a principled way, and behaves outside the range of the training data.

15:00- 15:30 Tea

15:30- 16:00 Bayesian Optimization with Tree-structured Dependencies Javier Gonzalez, Amazon

Bayesian optimization has been successfully used to optimize complex black-box functions whose evaluations are expensive. In many applications, like in deep learning and predictive analytics, the optimization domain is itself complex and structured. In this work, we focus on use cases where this domain exhibits a known dependency structure. The benefit of leveraging this structure is twofold; we explore the search space more efficiently and posterior inference scales more favorably with the number of observations than Gaussian Process-based approaches published in the literature. We introduce a novel surrogate model for Bayesian optimization which combines independent Gaussian Processes with a linear model that encodes a tree-based dependency structure and can transfer information between overlapping decision sequences. We also design a specialized two-step acquisition function that explores the search space more effectively. Our experiments on synthetic tree-structured functions and on the tuning of feedforward neural networks trained on a range of binary classification datasets show that our method compares favorably with competing approaches.