8:30- 8:45 Arrivals



8:45- 9:00 Welcome Ralf Herbrich, Amazon



9:00- 9:30 Random matrix theory and free probability for Gaussian latent variable models Manfred Opper, TU Berlin

Inference algorithms for Gaussian latent variable models such as Expectation Propagationand variational Gaussian approximations require frequent matrix inversions which make applications to large systems nontrivial. On the other hand, there are approaches to approximate inference which seem to take advantage of the fact that the number of random variables in the model is large. An example are the AMP (Approximate Message Passing) types of algorithms which have been applied to compressed sensing. A crucial assumption is that certain large matrices in the problem can be considered as random. In this talk I will discuss basic results of the “free probability” approach to random matrix theory and its potential to simplify matrix operations for approximate inference.

9:45- 10:15 Scalable Multi-Class Gaussian Process Classification using Expectation Propagation Daniel Hernández-Lobato, Universidad Autónoma de Madrid [slides]

We describe an expectation propagation (EP) method for multi-class classification with Gaussian processes that scales well to very large datasets. In such a method the estimate of the log-marginal-likelihood involves a sum across the data instances. This enables efficient training using stochastic gradients and mini-batches. When this type of training is used, the computational cost does not depend on the number of data instances N. Furthermore, extra assumptions in the approximate inference process make the memory cost independent of N. The consequence is that the proposed EP method can be used on datasets with millions of instances. We compare empirically this method with alternative approaches that approximate the required computations using variational inference. The results show that it performs similar or even better than these techniques, which sometimes give significantly worse predictive distributions in terms of the test log-likelihood. Besides this, the training process of the proposed approach also seems to converge in a smaller number of iterations.

10:30- 11:00 Coffee



11:00- 11:30 Doubly Stochastic Variational Inference for Deep Gaussian Processes Hugh Salimbeni, Imperial College London [slides]

Deep Gaussian Processes (DGPs) provide a Bayesian non-parametric alternative to traditional deep networks. A variational objective can be derived in closed form if the variational posterior is forced to factorize between and within layers, but this severe independence assumption does not work well in practice and does not readily scale to large data. We present a doubly stochastic variational inference algorithm that does not force independence between layers. The first source of stochasticity is Monte Carlo sampling of the lower bound. This allows us to use a rich posterior that matches the structure of the model. The second source of stochasticity is minibatch sub-sampling, permitting inference on very large data. With our approach we show that DGPs outperform shallow models on a wide range of benchmark classification and regression tasks, ranging in size from hundreds of data to tens of millions.

11:45- 12:15 Parallel and Distributed Thompson Sampling for Large-scale Accelerated Exploration of Chemical Space José Miguel Hernández Lobato, University of Cambridge [slides]

Chemical space is so large that brute force searches for new interesting molecules are infeasible. High-throughput virtual screening can speed up the discovery process by collecting very large amounts of data in parallel, e.g., up to hundreds or thousands of parallel measurements. Bayesian optimization (BO) can produce additional acceleration by sequentially identifying the most useful simulations or experiments to be performed next. However, current BO methods cannot scale to the large numbers of parallel measurements and the massive libraries of molecules currently used in high-throughput screening. Here, we propose a scalable solution based on a parallel and distributed implementation of Thompson sampling (PDTS). We show that, in small scale problems, PDTS performs similarly as parallel expected improvement (EI), a batch version of the most widely used BO heuristic. Additionally, in settings where parallel EI does not scale, PDTS outperforms other scalable baselines such as a greedy search, \epsilon-greedy approaches and a random search method. These results show that PDTS is a successful solution for large-scale parallel BO.

12:30- 13:30 Lunch



13:30- 14:00 Variational Fourier Features for Gaussian Processes James Hensman, prowler.io [slides]

This work brings together two powerful concepts in Gaussian processes':' the variational approach to sparse approximation and the spectral representation of Gaussian processes. This gives rise to an approximation that inherits the benefits of the variational approach but with the representational power and computational scalability of spectral representations. The work hinges on a key result that there exist spectral features related to a finite domain of the Gaussian process which exhibit almost-independent covariances. We derive these expressions for Matern kernels in one dimension, and generalize to more dimensions using kernels with specific structures. Under the assumption of additive Gaussian noise, our method requires only a single pass through the dataset, making for very fast and accurate computation. We fit a model to 4 million training points in just a few minutes on a standard laptop. With non-conjugate likelihoods, our MCMC scheme reduces the cost of computation from O(NM2) (for a sparse Gaussian process) to O(NM) per iteration, where N is the number of data and M is the number of features.

14:15- 14:45 Projection predictive model reduction for Gaussian process models Aki Vehtari, Aalto University [slides]

We propose a new method for simplification of Gaussian process (GP) models by projecting the information contained in the full encompassing model and selecting a reduced number of variables based on their predictive relevance. Our results on synthetic and real world datasets show that the proposed method improves the assessment of variable relevance compared to the automatic relevance determination (ARD) via the length-scale parameters. We expect the method to be useful for improving explainability of the models, reducing the future measurement costs and reducing the computation time for making new predictions.

15:00- 15:30 Tea



15:30- 16:00 Improved Differential Privacy using Inducing Variables Michael T. Smith, University of Sheffield [slides]

A major challenge for machine learning is increasing the availability of data while respecting the privacy of individuals. Here we combine the provable privacy guarantees of the Differential Privacy framework with the flexibility of Gaussian processes (GPs). We propose a method using GPs to provide Differentially Private (DP) regression. We then show that using inducing inputs allows us to reduce the scale of the added perturbation. We find that we are able to provide practical prediction accuracy, while still providing privacy guarantees for regression over multi-dimensional inputs. Together these methods provide a starter toolkit for combining differential privacy and GPs.

16:15- 16:45 Efficient Modeling of Latent Information in Supervised Learning using Gaussian Processes Zhenwen Dai, Amazon [slides]

Often in machine learning, data are collected as a combination of multiple conditions, e.g., the voice recordings of multiple persons, each labeled with an ID. How could we build a model that captures the latent information related to these conditions and generalize to a new one with few data? We present a new model called Latent Variable Multiple Output Gaussian Processes (LVMOGP) and that allows to jointly model multiple conditions for regression and generalize to a new condition with a few data points at test time. LVMOGP infers the posteriors of Gaussian processes together with a latent space representing the information about different conditions. We derive an efficient variational inference method for LVMOGP, of which the computational complexity is as low as sparse Gaussian processes. We show that LVMOGP significantly outperforms related Gaussian process methods on various tasks with both synthetic and real data.

20:30 Workshop dinner