Serge Belongie
09272001
02:45 AM ET (US)

Regarding Andrew's question, Figure 2 is hard to interpret because only the contours are shown and not the contour values. In the final version of the article (currently linked from the course homepage), they clarify in the text that in all three nonlinear cases, the first principal component varies monotonically along the parabola. The way to think of it then is to consider what values each point in the data set picks off in the different principal components. These values the points take on are the "coordinates" in the feature space. In the linear case, these coordinates merely replicate the original 2D coordinates exactly. In the quadratic case, for example, the first two coordinates in that feature space are very similar for points that are nearby one another along the curve: since there are no contour lines parallel to the underlying parabola, there is no axis along which the noise points can distinguish themselves. This doesn't happen until the 3rd component. So denoising happens when the 3rd component is discarded in the sense that the first 2 coordinates are basically ignorant of variation perpendicular to the curve but vary smoothly along the curve. This is clearer in the toy example with 3 clusters.

andrew cosand
09262001
01:38 AM ET (US)

I guess I'm still a bit confused by the toy example results in figure 2. Probably because I'm trying to visualize something higher dimensional in a lowD space, but I still have trouble seeing how a linear combination of the three components of degree 2 could uniquely specify a point. It seems like the middle one, with the concentric circles, would constrain a radius from the center of the circles, the top would constrain a point in a crossshaped shell, and the bottom one would constrain it in another crossshaped shell, but rotated about 45 degrees. Perhaps I'm off base and these could uniquely constrain a point?
The other thing i found curious is that the one which supposedly picks up the variance in the noise seems to determine how far the points lie from the parabola in a direction normal to the parabola, while the original noise was vertical and had no actual horizontal component.

Brandyn Webb
09252001
05:13 AM ET (US)

I'm curious about the methods and properties of online/incremental versions of KPCA. When I read the paper a couple days ago I thought I saw mention of it and a reference, but now I can't find it  any pointers appreciated. [I stuck my own thoughts on the matter here if anyone is interested (they seemed a wee long and digressing to paste here).]

HsinHao Yu
09242001
11:18 PM ET (US)

I have two comments about the example in page 9.
1. A very simple question: I hope we can spend sometime in class to learn to inteprete the result of Kernel PCA, because I find it harder to understand than linear PCA. The caption of Figure 2 says "Nonlinear PCA uses the third component to pick up the variance caused by the noise, as can be seen in the case of degree 2." Hmm, I don't think I can see that from the figure. Does it mean that pertubations caused by the noise tend to have similar projections on the third eigenvector? If that's the case then this noise is not wellcaptured by the d=3 and d=4 case.
2. I found it very interesting that the contour lines are simply hyperbolics and concentric circles. I understand that the contour lines depend on a.) the distribution of the data, which is artificial in this example, and b.) the kernel. But somehow my intuition is that these kind of contours are quite common in the polynomia kernel. Is there any heuristic that would predict the, say, concentric contours in the second component in the d=2 case?
An additional suggestion is: has anyone tried to apply kernel PCA to natural images, and see what the components look like (a la Bell and Sejnowski in their ICA paper)? We can try polynomial kernel with different order (d=2,3,4,5...) and see which develops simplecelllike receptive fields. This way we can infer that maybe only statistical relationship of a certain order is sufficient to account for the emergence of edge detectors.
Also, singleunit recording showed that neurons in visual area V4 are activated by concentric circles and hyperbolas. I did a project to show that these higer order features can indeed help object recognition. Maybe this can be explained in terms of kernel PCA (eg. detecting statistical relationship of a certain order)?

Markus Herrgard
09242001
10:13 PM ET (US)

I have experimented on using KPCA as a preprocessing method for some classification problems with both low and high dimensional input spaces. In general the choice of kernel and especially the number of eigenvectors (more is not always better) used for classification has a pretty significant effect on classifier performance. The good news is that using KPCA with a linear classifier allows one to look at the projections of the data on the KPCA eigenvectors directly to understand why the classifier does well or badly on a particular dataset (i.e. look for linear separability in the feature space). The bad news is that in most cases using KPCA doesn't improve classifier performance over just using normal PCA. In any case the method is so easy to implement and fast (at least for small data sets) that it is worth experimenting with especially since even a simple LDA classifier can have very good performance when KPCA is used as a preprocessing step.

Anand
09242001
07:51 PM ET (US)

Some more queries :
1. What is the relationship between the eigenvectors in the input space to the eigenvectors in the feature space ?
2. Do we always do better for any choice of Kernel function ? How can we interpret the additional eigenvalues if there is a dimension expansion ?
3. Suggestion :
Instead of using Kernel PCA for dimension expansion and for unravelling additional structure (features) in the data, can we use it for dimensionality reduction. i.e., if it is known that the input space is rank deficient, can we come up with a kernel which will map the input to a lower dimensional subspace. One obvious approach is to project the input on the subspace of eigenvectors with non zero eigen values, but this would be a linear operation. Can we come up with nonlinear dimensionality reduction schemes using Kernel PCA. This Kernel in conjunction with traditional clustering schemes like kmeans or EM can solve many of the instablity problems.

Junwen WU
09242001
02:43 PM ET (US)

I have a question: How to choose the kernel in thekernel PCA method? As far as I know, in current applications of support vector machine, one of the key problems is to choose a proper kernel. It depends on the researcher's experience in a large extent. I have done some work in the application of SVM, and in some cases, the different kernel may cause different results.(Although according to Vapnik's theory, the kernel shouldn't have much effect on the result.) So I wonder if the kernel chosen in PCA would have effect on the result and if there are some methods that can help us to choose a good kernel function. Thanks.


