QuickTopic free message boards logo
Skip to Messages

TOPIC:

Input space vs. feature space in kernel-based methods

15
Robin Zhu
10-17-2002
09:40 AM ET (US)
To Joe,
Would you please send me your paper about kernel trick?
my email: zhuhl@cs.ust.hk

Thanks.

Robin
14
Robin Zhu
10-17-2002
09:16 AM ET (US)
To Anand:
I have such a subroutine written by C++, if you need,
please feel free to let me know by email:

zhuhl@cs.ust.hk

Robin
13
Anand
09-17-2002
02:58 AM ET (US)
In fact we are developing a datamining app.
for that i have developed a back propagation tool
That is quite fine running for classifiaction purpose.

But i need to do some input preprocessing
by PCA (Principal Components Analysis)
In that i have to determine the eigen values of a
n by n correlation matrix.

And here i have to develope an algo. for it.
I m seraching on the Net.

If u can help me some way.
Please do
by
-Anand
12
Junwen WU
09-27-2001
06:28 PM ET (US)
To Joe:
I don't know much about denoising, what I said about classical denoising methods is about using filters, most likely are adaptive filter, to reduce the effects of noise. (Soemtimes maybe only mean-filter or median-filter. For example, I think median-filter can get good result to the USPS data which are contaminated by grain noise. It is much easier than KPCA. But anyway, as explained by Belongie, the comparison appeared in the paper is just to show the advantage of KPCA to PCA, so maybe what I said before is just irrelevant to the paper, :)) I know people have done much work on using wavelet tranform in denoising, such as thresholding the wavelet coefficients to get a denoised image.
11
Joe Drish
09-27-2001
01:42 PM ET (US)
To Wu: I think that's an interesting point. What are the other classical de-noising methods that they should be comparing KPCA to, instead of linear PCA?

To andrew: Thanks for picking that out. 3000 would allow them them to train on more examples, but using 300 would make their data set more balanced, since they only used 50 examples to test. So I'm not really sure.

Joe
10
Markus Herrgard
09-27-2001
01:03 PM ET (US)
Regarding the issue of choosing kernels, wouldn't the kind of differential geometry based analysis done in section II-E give some idea how well a given kernel should work. As an example when I was playing around with KPCA on simple datasets (e.g. Fischer's iris data) I noticed that some kernels (e.g. homogenous polynomial and gaussian) had an apparent tendency of mapping multiple data points in the input space to a very small region in the feature space. This could possibly correspond to the kind of singularities in the feature space discussed in section II-E. Avoiding this kind of singularities would be quite useful especially if the input space data points are from different classes and one wants to use the KPCA projections as inputs to a classifier. However, it seems unlikely that there exists a simple way to use e.g. the curvature tensor R of a kernel mapping to decide what would be the best kernel for a given case.
9
Gyozo Gidofavi
09-27-2001
12:32 PM ET (US)
This is a reply to the general question asked by Dave Kauchak about the number "n". As far as I know, as a general rule of thumb, if you want to cover X% of the variance in the data set, the selection of "n" can be as follows: select "n" such that:
(sum((i=1 to n) eigen-value(i)) / sum(all eigen-values)) = X/100. In other words select the first "n" eigen-vector components such that the normalized cumulative sum of the corresponding eigen-values is X/100, if X was a percentage. Normally X is usually 90% but this is can be really problem specific and in most cases is determined experimentally. Finally I’m not certain about whether larger number of eigen-vector components always give better results, in fact after a certain point to my understanding it can severely degrade performance.
Once again, let me repeat that the method for selection of "n" described above is a general rule of thumb.
Edited 09-27-2001 12:53 PM
8
andrew cosand
09-27-2001
06:23 AM ET (US)
One trifling question- the bottom of the first col. of page 13 says 300 examples, whereas the middle of the second col. says 3000? I assume this is a typo but does anyone know which is right? Just curious....

As for the choice-of-kernel debate, I'm pretty sure that since the idea is to use a kernel of the form
k(x,y) = (\Phi(x) \dot \Phi(y))
where \Phi maps from the input space to some useful space where things are nicely seperable and such, the kernel has to make a difference. If you used a kernel that related to a \Phi that mapped the input space into some lame space where the data wasn't usefully distinguishable, I don't see how it would work. To give a sneek preview of my paper, it involves finding Gaussian-weghted distances between objects (which can be seen as a kernel operation), which works better than a bunch of other measures you could pick. Therefore, I cast my vote for the kernel-has-to-matter side.

I was going to say something about the notation being unfamiliar and getting somewhat lost in the math, but since we're not delving into that in great detail I'll save that comment for later ;-)
7
Dave KauchakPerson was signed in when posted
09-27-2001
03:56 AM ET (US)
Ok, I'm not particularly familiar with this material, so if the question sounds funny, that's why :) These PCA methods choose some number of components, n, for use in reconstruction. How is this number generally chosen? In the paper, they present the results for varying n and the choice of n obviously has an impact on the performance. Given the originaly images, it's fairly easy to see which n perform better, but is this a hard problem in practice? Does the performance always increase as n increases?
6
Wu Junwen
09-27-2001
02:39 AM ET (US)
1. I think the most practical function of PCA is to do feature extract and dimention reduction, not noise reduction, while KPCA has a main advantage that it can capture the data structure easily, so that the noise that deviate the main structure can be percepted. So I don't think it is proper to compare the denoising result between linear PCA and KPCA. I am wondering how about the result if comparing the denoising experiment of kernel PCA and other classical denoising methods.

2. I think there is still some assumption about the data model: PCA can't deal with the uniform data. But practically, in real world, the data wouldn't be uniform. So that's why always we can adopt PCA without any concern.

3. If we only use the kernels that always have pre-image, then Gaussian kernel would be banned. A great loss!
Edited 09-27-2001 02:51 AM
5
Joe Drish
09-27-2001
02:22 AM ET (US)
*...would the utility of a particular kernel be related to how well it unwraps a particular distribution into a gaussian shape,...*

Given what Anand said, that PCA does not assume any source model on the input space, then I guess the answer is no.

*...and correspondingly does that imply an inescapable tie between kernel choice and problem domain, or might there be a "universal" mapping which tends to map everything into gaussians (i.e., perhaps some of the infinite dimensional mappings have this property?)?*

I think there is a tie between kernel choice and problem domain, but that tie is not necessarily linked to how well it unwraps a particular into a gaussian shape. I am not aware of any precise heuristics for choosing kernels. Maybe someone else does. Experience, or using what has worked well on a particular application before, is usually how kernels are chosen in practice from what I understand.
4
Anand
09-27-2001
02:06 AM ET (US)
PCA and Gaussian sources :

In my opinion, PCA does not assume any source model on the input space per se (surprise, surprise). In fact, this has been cited as one of the drawbacks of the conventional PCA by Tipping and Bishop in their paper titled Probabilistic PCA where they assume a latent variable source model.

Then why do people keep connecting PCA to Gaussian sources ? The reason is, depending on the application it is employed for, PCA has some additional special properties (such as global optimality w.r.t a cost criterion) when the source is jointly Gaussian. Let me explain this statement with a compression example.

One way to look at PCA is as a transform (called the Karhunen Leove transform) which rotates the axes of cordinates. In low complexity compression, one usually rotates the data and then compresses each dimension independently. It is a standard compression result, that when the sources are jointly Gaussian, the KL transform (PCA) is the globally optimal way to rotate the data among all transforms.

Similar results exist for other applications (such as minimum mean square estimation, classification) where PCA emerges as the optimal candidate when the sources are jointly Gaussian.
3
Hsin-Hao Yu
09-27-2001
01:42 AM ET (US)
1. Regarding to Brandyn's question: I guess it is a common knowledge that PCA somehow assumes Gaussians, as Javier pointed out last week. But since I am quite new to this feild, can anyone try to explain the relationship between PCA and Gaussian sources? I thought the relationship was this: PCA rotates your data so that they become indepepdent if the sources are gaussian. So if you don't care too much about
independence I don't see why we should chose "gaussianfying" kernels. I mean, no matter what the distribution it is (in the feature space), PCA will still give you maximal variance at the first eigenvector, and that's what's so useful about PCA, no? Please forgive me if this is a silly argument.

2. The reproducing kernel section is confusing to me. I can see that for each kernel a map can be constructed from the input space (X) to the RKHS, but I don't quite follow how the RKHS can be interpreted as the feature space, since elements in RKHS are functions.

3. This is a completely irreponsible suggestion (meaning that I have no idea how this can be done, and I am not even going to try), but maybe it is worthwhile to ask this question: is it possible to characterize the set of kernels that always have pre-image? If we restrict
ourselves to this set of kernels, how much would it reduce to power of kernel PCA?
2
Brandyn Webb
09-26-2001
10:33 PM ET (US)
Hi Joe et al,

A few things I'd love to hear anyone's thoughts on:

From the perspective of "modeling" the input space, linear PCA, if I understand it right, essentially assumes the input data is a Gaussian blob centered about the origin, and the eigenvectors/values returned represent the scaling of that blob in various directions, in descending order of magnitude. In that sense, then, is the chief aim of KPCA to project a non-gaussian input distribution into gaussian distribution (typically in a higher dimensional space, but it could just as well be the same or lower depending on the mapping)? And if so, would the utility of a particular kernel be related to how well it unwraps a particular distribution into a gaussian shape, and correspondingly does that imply an inescapable tie between kernel choice and problem domain, or might there be a "universal" mapping which tends to map everything into gaussians (i.e., perhaps some of the infinite dimensional mappings have this property?)?

For a converse question, is it conceivable that certain distributions could be strongly resistant to ever appearing gaussian? One possible example is what I might call the "curse of degeneracy", which is perhaps what the curse of dimensionality becomes in KPCA: Consider an image of a tic-tac-toe board, but with no particular rules so each square either gets an X or an O. If viewed locally, each sub-image (one square) has two very obvious clusters ("X" and "O"), but the ensemble, rather than having 18 clusters, has 2^9 clusters which are annoyingly space filling in the sense of occupying all of the corners of the hypercube spanned by the 9 (in the optimal case) feature axis. (What's the plural of axis?) Worse, because of the uniformity of the independent distributions (each square choosing X with 50% odds, say), the covariance matrix would be degenerate -- and in a manner which a "generic" (i.e., with no tricks to specifically address this problem) non-linear kernel may not change since the root of the problem is that any two non-identical tic-tac-toe boards are going to have orthogonal projections onto the data set, pretty much regardless of the kernel as long as the kernel is centered and does not bias one square over another. The net result--and please correct me if I'm wrong--is that whatever eigenvectors you get will be pretty much random depending on the particulars of the dataset, but that what you won't get is 9 clear winners representing the 9 independent squares. If there's some way that KPCA does solve this, I'd love to know. It's a toy pathalogical example, but I think it has analogies in real-world images. (Perhaps that's a faulty assumption?)

Thoughts?
Edited 09-26-2001 11:03 PM
1
Joe Drish
09-25-2001
08:33 PM ET (US)
Hi all.

This paper talks about the kernel trick. The kernel trick is used in order to minimize the computational burden associated with computing dot products. The kernel trick proceeds by mapping data from input space into a higher dimensional feature space and then performing the dot product there. The problem is that once you have mapped the data into a higher dimensional space and then have separated the data, how do you return to input space? This paper specifically addresses this problem. It talks about how it does this, which is by using fixed-point iteration. This paper is important because it addresses this fundamental problem, and also because it demonstrated how it can be used for a particular application: de-noising. De-noising experiments are shown at the end.

The paper is divided into 5 sections. Section one defines what kernels are, and introduces the kernel trick. The second section talks about peforming the mapping of data from input space to feature space. Mercer theory is the mathematics of describing the conditions under which the kernel trick can be applied. The third section reviews two algorithms that are used for linear separation of data in feature space, namely kernel principal components analysis (KPCA) and support vector machines (SVMs). The fourth section is the most important, since it describes how to return from feature space to input space. The method that is used to do that is called fixed-point iteration, as mentioned above. Section 5 discusses reduced set methods. I am not going to talk about section 5 in the talk. The final section, section 6, demonstrates various experiments that are used for de-noising.

The most important point of the paper is the notion of the kernel trick, and also the method they use to return from feature space to input space. So the most important sections are 1 and 4.

My talk will be very conceptual and intuitive. I will not delve into many of the mathematical details. I will just try to provide a high-level overview of the important concepts that are covered in the paper.

Joe
Edited 09-25-2001 08:35 PM

Print | RSS Views: 822 (Unique: 572 ) / Subscribers: 0 | What's this?