GPIP Workshop: First Discussion Session

First Discussion Session

Phil Torr :What do you do when the Covariance matrix is rank deficient?

Carl: Mathematically a covariance function is positive definite, but mathematically it may have eigenvalues close to zero, the problem arises in noise-free situations. The usual trick is to add 10^-6 to the diagonal of the covariance data, which implicitly is adding noise of standard deviation of 10^-3. In practice this seems to work.

Chris: Said that if it is genuinely a degenerate Gaussian process, then maybe you should be working in the weightspace view point.

Yaakov: suggested using one of the sparse approaches (which are implicitly low rank) and 'inducing variables' until you hit the right rank.

Carl: Mentioned that if you repeat an input then you get a zero eigenvalue staright away however in this context there must be no noise.

Anton: Mentioned that when computing for EP and optimising hyperparameters problems arose.

Carl said that if the noise goes to small values the ratio of the largest to smallest eigenvalues gets too big and this is the key problem.

Phil: asked if two points are the same is the only way of having a degenerate kernel.

Dan said in spatial statistics it is well known that two points together lead to difficulties and you have to add noise.

Yaakov: wanted to know if he is doing regression with true noise covariance a and he has GP with noise variance b. Can anything be said about the convergence of the Gaussian process to the data.

Chris's naïve thoughts are that you can use the equivalent kernel analysis, in this you can obtain an analytic form for the weighting function which is dependent on noise level. This becomes close to a delta function as the data increases. Given this analysis Chris expects it would converge, but you get a sub-optimal rate of convergence.

Yaakov said that he was actually interested in noise with temporal structure, so you have a full covariance for representing it.

David said ... so you are splitting the situation into signal and noise in the kernel functions ... and what happens if you have the colour spectrum of the noise covariance wrong. David said he thought you could have the same normal bland convergence analysis like those you get for neural networks. But the rate of convergence towards the truth can be very bad.

Yaakov gave an example of exchanging the noise and signal kernels, then it would be different.

David said the bland result depends on the family of kernels you try. If the underlying basis functions are infinite and smooth ... then it will work, and by switching the noise and the signal, you are not using smooth kernel functions.

David said he thought that the convergence of the system was not the important practical point, but Yaakov thought wanted that reassurance that it would converge.

Dan pointed out that if the model was wrong then the inference would be wrong and it was important to specify the correct covariance.

Yaakov felt it was an important issue for when the model is mis-specified.

Roger felt you could do this through a eigendecomposition of the kernel.

David felt that the more important point was that in general the noise would typically be non-Gaussian and we should worry more about that point.

Yaakov: saw this as an orthogonal issue.

Carl: felt that you couldn't answer this question because the model could be very wrong.

Ed said that he felt Peter Sollich computed some of these rates, but Chris thought that these were for misspecified kernels not noise models.

Andrew Stoddart asked what are the big problems facing the community at the moment.

David started a slide show.

He had a series of challenges:

Efficient Methods

He thought that a challenge would be to sample one function from a prior with many data points and get the inference for that down to N^2

With sharing independent sets of samples from the prior (12 independent N-component samples) can we learn more.

Manfred wondered why it was a sensible thing to do, I missed the answer.

Roger felt that it would be sensible to look at Stationary kernels as a first case. Chris said that you could do this on Grids for spatial kernels.

Manfred asked if it depended on the input dimensionality, Chris said it would, and David said we should agree on the dimension.

Anton mentioned that there was often a problem of multi-modality. David had said that this wasn't a big issue, and Anton wanted to know how you resolve the problem.

David said that a lot of it is fiddling with priors ... to think carefully what the length scales in an ARD kernel should be. If the length scale could be really short in multiple dimensions then it would be impossible to predict anything, so he felt that length scales should be longer. David's experience was that on real problems there weren't big problems.

Carl said that multi-modality was a good thing for complicated data sets, and that there might be many sensible explanations for a data set.

Pashmeet said that the inducing points lead to multi-modality, and that there is a trade off between increasing inducing points numbers affect multi-modality.

David said that there may be an analogy with neural networks, that when you increased the number of nodes in a neural network the modes became interconnected, and Hamiltonian (hybrid) Monte-Carlo would resolve the issue.

Dan said that you should only use more complicated models in the covariance function then this should be on the basis of understanding, and there should be sensible priors for these models. If you can't say anything about the parameters you are adding you shouldn't be adding them.

Rod asked about multiple output regression cases, how do you deal with multiple correlated outputs.

Dan said that in atmospheric science you could construct kernels which incorporate that knowledge and the correlations between outputs emerges naturally. But felt that this was a difficult challenge when you didn't have this information.

Guido wanted to know about the eigenfunction approach ... certain kernel, certain probability on the inputs ... can the eigenfuctions give you insight into the model that you choose ... I believe the eigenfunctions should be of this form, so what should the kernel be.

Chris said, that certainly it was true if you had different families of covariance functions you could fit them to the data. If you had a lot of draws from something that you wanted to model you could build a covariance matrix of those draws and compute the eigenfunctions of that, then you could find out issues such as non-stationarity from the data. He felt that atmospherical-geoscientists look at this sort of analysis.

Carl wanted to come back to the point that we should use covariance functions with more parameters. He said that he didn't subscribe to the point of view that you would always end up with connected modes. Carl said that he would explore the space of models by adding e.g. non-Gaussianity, that you might not know these things up front.

Manfred asked how much David thought non-Gaussian situations played a role, David said it was an impossible question to answer ...

But David said he was an eigenfunction sceptic because if you change p(x) then the eigenfunctions change, but the predictions don't change in practice. So he felt that it was meaningless. David suggested looking at his web-page to see a rant about why eigenfunction analysis are not a good thing ... Roger said except for answering Yaakov's question ...

Chris responded by saying that he thought the most interesting question was design of kernels for given problems, but then the other area of interest was large data.

Regent Court, 211 Portobello Street, Sheffield, S1 4DP