Robin Zhu
10172002
09:40 AM ET (US)

To Joe, Would you please send me your paper about kernel trick? my email: zhuhl@cs.ust.hk
Thanks.
Robin

Robin Zhu
10172002
09:16 AM ET (US)

To Anand: I have such a subroutine written by C++, if you need, please feel free to let me know by email:
zhuhl@cs.ust.hk
Robin

Anand
09172002
02:58 AM ET (US)

In fact we are developing a datamining app. for that i have developed a back propagation tool That is quite fine running for classifiaction purpose.
But i need to do some input preprocessing by PCA (Principal Components Analysis) In that i have to determine the eigen values of a n by n correlation matrix.
And here i have to develope an algo. for it. I m seraching on the Net.
If u can help me some way. Please do by Anand

Junwen WU
09272001
06:28 PM ET (US)

To Joe: I don't know much about denoising, what I said about classical denoising methods is about using filters, most likely are adaptive filter, to reduce the effects of noise. (Soemtimes maybe only meanfilter or medianfilter. For example, I think medianfilter can get good result to the USPS data which are contaminated by grain noise. It is much easier than KPCA. But anyway, as explained by Belongie, the comparison appeared in the paper is just to show the advantage of KPCA to PCA, so maybe what I said before is just irrelevant to the paper, :)) I know people have done much work on using wavelet tranform in denoising, such as thresholding the wavelet coefficients to get a denoised image.

Joe Drish
09272001
01:42 PM ET (US)

To Wu: I think that's an interesting point. What are the other classical denoising methods that they should be comparing KPCA to, instead of linear PCA?
To andrew: Thanks for picking that out. 3000 would allow them them to train on more examples, but using 300 would make their data set more balanced, since they only used 50 examples to test. So I'm not really sure.
Joe

Markus Herrgard
09272001
01:03 PM ET (US)

Regarding the issue of choosing kernels, wouldn't the kind of differential geometry based analysis done in section IIE give some idea how well a given kernel should work. As an example when I was playing around with KPCA on simple datasets (e.g. Fischer's iris data) I noticed that some kernels (e.g. homogenous polynomial and gaussian) had an apparent tendency of mapping multiple data points in the input space to a very small region in the feature space. This could possibly correspond to the kind of singularities in the feature space discussed in section IIE. Avoiding this kind of singularities would be quite useful especially if the input space data points are from different classes and one wants to use the KPCA projections as inputs to a classifier. However, it seems unlikely that there exists a simple way to use e.g. the curvature tensor R of a kernel mapping to decide what would be the best kernel for a given case.

Gyozo Gidofavi
09272001
12:32 PM ET (US)

This is a reply to the general question asked by Dave Kauchak about the number "n". As far as I know, as a general rule of thumb, if you want to cover X% of the variance in the data set, the selection of "n" can be as follows: select "n" such that: (sum((i=1 to n) eigenvalue(i)) / sum(all eigenvalues)) = X/100. In other words select the first "n" eigenvector components such that the normalized cumulative sum of the corresponding eigenvalues is X/100, if X was a percentage. Normally X is usually 90% but this is can be really problem specific and in most cases is determined experimentally. Finally I’m not certain about whether larger number of eigenvector components always give better results, in fact after a certain point to my understanding it can severely degrade performance. Once again, let me repeat that the method for selection of "n" described above is a general rule of thumb. Edited 09272001 12:53 PM

andrew cosand
09272001
06:23 AM ET (US)

One trifling question the bottom of the first col. of page 13 says 300 examples, whereas the middle of the second col. says 3000? I assume this is a typo but does anyone know which is right? Just curious....
As for the choiceofkernel debate, I'm pretty sure that since the idea is to use a kernel of the form k(x,y) = (\Phi(x) \dot \Phi(y)) where \Phi maps from the input space to some useful space where things are nicely seperable and such, the kernel has to make a difference. If you used a kernel that related to a \Phi that mapped the input space into some lame space where the data wasn't usefully distinguishable, I don't see how it would work. To give a sneek preview of my paper, it involves finding Gaussianweghted distances between objects (which can be seen as a kernel operation), which works better than a bunch of other measures you could pick. Therefore, I cast my vote for the kernelhastomatter side.
I was going to say something about the notation being unfamiliar and getting somewhat lost in the math, but since we're not delving into that in great detail I'll save that comment for later ;)

Dave Kauchak
09272001
03:56 AM ET (US)

Ok, I'm not particularly familiar with this material, so if the question sounds funny, that's why :) These PCA methods choose some number of components, n, for use in reconstruction. How is this number generally chosen? In the paper, they present the results for varying n and the choice of n obviously has an impact on the performance. Given the originaly images, it's fairly easy to see which n perform better, but is this a hard problem in practice? Does the performance always increase as n increases?

Wu Junwen
09272001
02:39 AM ET (US)

1. I think the most practical function of PCA is to do feature extract and dimention reduction, not noise reduction, while KPCA has a main advantage that it can capture the data structure easily, so that the noise that deviate the main structure can be percepted. So I don't think it is proper to compare the denoising result between linear PCA and KPCA. I am wondering how about the result if comparing the denoising experiment of kernel PCA and other classical denoising methods. 2. I think there is still some assumption about the data model: PCA can't deal with the uniform data. But practically, in real world, the data wouldn't be uniform. So that's why always we can adopt PCA without any concern. 3. If we only use the kernels that always have preimage, then Gaussian kernel would be banned. A great loss! Edited 09272001 02:51 AM

Joe Drish
09272001
02:22 AM ET (US)

*...would the utility of a particular kernel be related to how well it unwraps a particular distribution into a gaussian shape,...*
Given what Anand said, that PCA does not assume any source model on the input space, then I guess the answer is no.
*...and correspondingly does that imply an inescapable tie between kernel choice and problem domain, or might there be a "universal" mapping which tends to map everything into gaussians (i.e., perhaps some of the infinite dimensional mappings have this property?)?*
I think there is a tie between kernel choice and problem domain, but that tie is not necessarily linked to how well it unwraps a particular into a gaussian shape. I am not aware of any precise heuristics for choosing kernels. Maybe someone else does. Experience, or using what has worked well on a particular application before, is usually how kernels are chosen in practice from what I understand.

Anand
09272001
02:06 AM ET (US)

PCA and Gaussian sources :
In my opinion, PCA does not assume any source model on the input space per se (surprise, surprise). In fact, this has been cited as one of the drawbacks of the conventional PCA by Tipping and Bishop in their paper titled Probabilistic PCA where they assume a latent variable source model.
Then why do people keep connecting PCA to Gaussian sources ? The reason is, depending on the application it is employed for, PCA has some additional special properties (such as global optimality w.r.t a cost criterion) when the source is jointly Gaussian. Let me explain this statement with a compression example.
One way to look at PCA is as a transform (called the Karhunen Leove transform) which rotates the axes of cordinates. In low complexity compression, one usually rotates the data and then compresses each dimension independently. It is a standard compression result, that when the sources are jointly Gaussian, the KL transform (PCA) is the globally optimal way to rotate the data among all transforms.
Similar results exist for other applications (such as minimum mean square estimation, classification) where PCA emerges as the optimal candidate when the sources are jointly Gaussian.

HsinHao Yu
09272001
01:42 AM ET (US)

1. Regarding to Brandyn's question: I guess it is a common knowledge that PCA somehow assumes Gaussians, as Javier pointed out last week. But since I am quite new to this feild, can anyone try to explain the relationship between PCA and Gaussian sources? I thought the relationship was this: PCA rotates your data so that they become indepepdent if the sources are gaussian. So if you don't care too much about independence I don't see why we should chose "gaussianfying" kernels. I mean, no matter what the distribution it is (in the feature space), PCA will still give you maximal variance at the first eigenvector, and that's what's so useful about PCA, no? Please forgive me if this is a silly argument.
2. The reproducing kernel section is confusing to me. I can see that for each kernel a map can be constructed from the input space (X) to the RKHS, but I don't quite follow how the RKHS can be interpreted as the feature space, since elements in RKHS are functions.
3. This is a completely irreponsible suggestion (meaning that I have no idea how this can be done, and I am not even going to try), but maybe it is worthwhile to ask this question: is it possible to characterize the set of kernels that always have preimage? If we restrict ourselves to this set of kernels, how much would it reduce to power of kernel PCA?

Brandyn Webb
09262001
10:33 PM ET (US)

Hi Joe et al, A few things I'd love to hear anyone's thoughts on: From the perspective of "modeling" the input space, linear PCA, if I understand it right, essentially assumes the input data is a Gaussian blob centered about the origin, and the eigenvectors/values returned represent the scaling of that blob in various directions, in descending order of magnitude. In that sense, then, is the chief aim of KPCA to project a nongaussian input distribution into gaussian distribution (typically in a higher dimensional space, but it could just as well be the same or lower depending on the mapping)? And if so, would the utility of a particular kernel be related to how well it unwraps a particular distribution into a gaussian shape, and correspondingly does that imply an inescapable tie between kernel choice and problem domain, or might there be a "universal" mapping which tends to map everything into gaussians (i.e., perhaps some of the infinite dimensional mappings have this property?)? For a converse question, is it conceivable that certain distributions could be strongly resistant to ever appearing gaussian? One possible example is what I might call the "curse of degeneracy", which is perhaps what the curse of dimensionality becomes in KPCA: Consider an image of a tictactoe board, but with no particular rules so each square either gets an X or an O. If viewed locally, each subimage (one square) has two very obvious clusters ("X" and "O"), but the ensemble, rather than having 18 clusters, has 2^9 clusters which are annoyingly space filling in the sense of occupying all of the corners of the hypercube spanned by the 9 (in the optimal case) feature axis. (What's the plural of axis?) Worse, because of the uniformity of the independent distributions (each square choosing X with 50% odds, say), the covariance matrix would be degenerate  and in a manner which a "generic" (i.e., with no tricks to specifically address this problem) nonlinear kernel may not change since the root of the problem is that any two nonidentical tictactoe boards are going to have orthogonal projections onto the data set, pretty much regardless of the kernel as long as the kernel is centered and does not bias one square over another. The net resultand please correct me if I'm wrongis that whatever eigenvectors you get will be pretty much random depending on the particulars of the dataset, but that what you won't get is 9 clear winners representing the 9 independent squares. If there's some way that KPCA does solve this, I'd love to know. It's a toy pathalogical example, but I think it has analogies in realworld images. (Perhaps that's a faulty assumption?) Thoughts? Edited 09262001 11:03 PM

Joe Drish
09252001
08:33 PM ET (US)

Hi all. This paper talks about the kernel trick. The kernel trick is used in order to minimize the computational burden associated with computing dot products. The kernel trick proceeds by mapping data from input space into a higher dimensional feature space and then performing the dot product there. The problem is that once you have mapped the data into a higher dimensional space and then have separated the data, how do you return to input space? This paper specifically addresses this problem. It talks about how it does this, which is by using fixedpoint iteration. This paper is important because it addresses this fundamental problem, and also because it demonstrated how it can be used for a particular application: denoising. Denoising experiments are shown at the end. The paper is divided into 5 sections. Section one defines what kernels are, and introduces the kernel trick. The second section talks about peforming the mapping of data from input space to feature space. Mercer theory is the mathematics of describing the conditions under which the kernel trick can be applied. The third section reviews two algorithms that are used for linear separation of data in feature space, namely kernel principal components analysis (KPCA) and support vector machines (SVMs). The fourth section is the most important, since it describes how to return from feature space to input space. The method that is used to do that is called fixedpoint iteration, as mentioned above. Section 5 discusses reduced set methods. I am not going to talk about section 5 in the talk. The final section, section 6, demonstrates various experiments that are used for denoising. The most important point of the paper is the notion of the kernel trick, and also the method they use to return from feature space to input space. So the most important sections are 1 and 4. My talk will be very conceptual and intuitive. I will not delve into many of the mathematical details. I will just try to provide a highlevel overview of the important concepts that are covered in the paper. Joe Edited 09252001 08:35 PM


