# Advances in Kernel Methods

[edit]

### Room: PLB-DS05, Pam Liversidge Building

8:00- 9:00

**Arrivals**

9:00- 9:10

**Welcome**Neil Lawrence Amazon Research Cambridge

9:10- 9:50

**Infinite-Task Learning with Vector-Valued RKHSs**Florence d’Alché-Buc LTCI, Télécom ParisTech

9:50- 10:30

**Assembling stochastic quasi-Newton algorithms using GPs**Thomas Schön Uppsala University [slides]

10:30- 11:00

**Coffee Break**

11:00- 11:40

**Representation Learning with Kernel Methods**Jovana Mitrovic University of Oxford

11:40- 12:20

**Positive definite kernels for deterministic and stochastic approximations of (invariant) functions**David Ginsbourger Idiap and University of Bern [slides]

12:20- 13:30

**Lunch**

13:30- 14:10

**A Bootstrap Algebraic MultiGrid method for Spectral Clustering**Luisa Cutillo University of Leeds [slides]

14:10- 14:50

**Kernel Distances for Better Deep Generative Models**Dougal Sutherland Gatsby Computational Neuroscience Unit, University College London [slides]

14:50- 15:30

**Approximate Kernel Methods and Learning on Aggregates**Dino Sejdinovic University of Oxford [slides]

15:30- 16:00

**Tea Break**

16:00- 17:00

**Panel and Dicussion**

# Abstracts

**Infinite-Task Learning with Vector-Valued RKHSs**

#### Abstract

Machine learning has witnessed the tremendous success of solving tasks depending on a hyperparameter. While multi-task learning is celebrated for its capacity to solve jointly a finite number of tasks, learning a continuum of tasks for various loss functions is still a challenge. A promising approach, called Parametric Task Learning, has paved the way in the case of piecewise-linear loss functions. We propose a generic approach, called Infinite-Task Learning, to solve jointly a continuum of tasks via vector-valued RKHSs. We provide generalization guarantees to the suggested scheme and illustrate its efficiency in cost- sensitive classification, quantile regression and density level set estimation.

**Assembling stochastic quasi-Newton algorithms using GPs**

#### Abstract

In this talk I will focus on one of our recent developments where we show how the Gaussian process (GP) can be used to solve stochastic optimization problems. Our main motivation for studying these problems is that they arise when we are estimating unknown parameters in nonlinear state space models using sequential Monte Carlo (SMC). The very nature of this problem is such that we can only access the cost function (in this case the likelihood function) and its derivative via noisy observations, since there are no closed-form expressions available. Via SMC methods we can obtain unbiased estimates of the likelihood function. We start from the fact that many of the existing quasi-Newton algorithms can be formulated as learning algorithms, capable of learning local models of the cost functions. Inspired by this we can start assembling stochastic quasi-Newton-type algorithms, applicable in situations where we only have access to noisy observations of the cost function and its derivatives. We will show how we can make use of the GP model to learn the Hessian allowing for efficient solution of these stochastic optimization problems. Additional motivation for studying the stochastic optimization problem stems from the fact that it arise in almost all large-scale supervised machine learning problems, not least in deep learning. I will very briefly mention some ongoing work where we have removed the GP representation and scale our ideas to much higher dimensions (both in terms of the size of the dataset and the number of unknown parameters). If there is time towards the end I will also show a few additional GP constructions that we have been developing recently.

**Representation Learning with Kernel Methods**

#### Abstract

In this talk, I will discuss the problem of representation learning using kernel methods with particular focus on two domains -- likelihood-free inference and causal discovery. First, I will discuss the challenge of summary statistic construction within Approximate Bayesian Computation (ABC). Although the choice of summary statistics crucially influences the quality of the posterior sample in ABC, there are only few principled general-purpose approaches for the selection or construction of appropriate summary statistics. To address this, I will present our recent work on ABC with kernel-based distribution regression (DR-ABC) for automatically constructing informative problem-specific summary statistics. Second, I will discuss the data representation requirements in causal discovery. Due to ethical, technical and financial constraints, data from randomized control trials is often not available, and we have to resort to inferring causal relationships from observational data. As one potential approach to this, I will present our recent work on Kernel Conditional Deviance for Causal Inference (KCDC) which is a fully nonparametric causal discovery method based on purely observational data.

**Positive definite kernels for deterministic and stochastic approximations of (invariant) functions**

#### Abstract

Positive definite kernels have been used as a basic brick in function approximation, notably via the theory of reproducing kernel Hilbert spaces. In addition, they play a crucial role in Gaussian process modelling through the notion of covariance. Here we consider the problem of approximating functions known to be invariant (or degenerate) under specific classes of linear operators, and we present some implications in kernel methods. In particular, simulation and prediction examples are used to illustrate how GP models can incorporate a number of "structural priors" including group invariances, multivariate sparsity, or harmonicity as particular cases. Based on a series of joint works primarily with Xavier Bay, Laurent Carraro, Nicolas Durrande, Nicolas Lenz, Olivier Roustant and Dominic Schuhmacher.

**A Bootstrap Algebraic MultiGrid method for Spectral Clustering**

#### Abstract

Graph Laplacian is a popular tool for analyzing graphs, in particular in graph partitioning and clustering. Given a notion of similarity (via an adjacency matrix), graph clustering refers to identifying different groups such that vertices in the same group are more similar compared to vertices across different groups. Data clustering can be reformulated in terms of a graph partitioning problem when the given set of data is represented as a graph, also known as similarity graph. In this context, eigenvectors of the graph Laplacian are often used to obtain a new geometric representation of the original data set which generally enhances cluster properties and improves cluster detection. In this work, we apply a bootstrap Algebraic MultiGrid (AMG) method which constructs a set of vectors associated with the graph Laplacian. These vectors, referred to as algebraically smooth ones, span a low-dimensional Euclidean space which we use to represent the data enabling cluster detection both in synthetic and in realistic well-clustered graphs. We show that in the case of a good quality bootstrap AMG, the computed smooth vectors employed in the construction of the final AMG operator, which by construction is spectrally equivalent to the originally given graph Laplacian, accurately approximate the space in the lower portion of the spectrum of the preconditioned operator. Thus, our approach can be viewed as a spectral clustering technique associated with the generalized spectral problem (Laplace operator versus the final AMG operator), and hence it can be seen as an extension of the classical spectral clustering which employs a standard eigenvalue problem.

**Kernel Distances for Better Deep Generative Models**

#### Abstract

Generative adversarial networks have led to huge improvements in sample quality for image generation. But their success is hindered by both practical and theoretical problems, leading to the proposal of a huge number of alternative methods over the last few years. We study one class of alternatives, the MMD GAN, which uses a similar architecture to an original GAN but incorporates a partial closed-form optimization in Hilbert space. We deepen the understanding of these models, with a particular focus on the behavior of gradient penalties – inspired by the WGAN-GP and the more recent Sobolev GAN – in this context. Based on this, we propose a method to constrain the gradient analytically, rather than with an additive optimization penalty. MMD GANs with additive gradient penalties improve on the existing state-of-the-art WGAN-GP; our new method, the Scaled MMD GAN, does even better on unsupervised image generation on CelebA and ImageNet.

**Approximate Kernel Methods and Learning on Aggregates**

#### Abstract

While a typical supervised learning framework assumes that the inputs and the outputs are measured at the same levels of granularity, many applications only have access to outputs at a much coarser (aggregate) level. Kernel embeddings of distributions are well established as useful tools for fully nonparametric hypothesis testing, but they also found applications in learning and predicting on distributional inputs corresponding to such aggregate outputs. I will describe the use of large-scale approximations to kernel embeddings in the context of Bayesian approaches to learning on aggregates, as well as a variational approach using Gaussian processes and its application in fine-scale spatial modelling of disease.