Piotr Dollar
10012002
02:16 PM ET (US)

Paper looks interesting. At this point I need more background to understand things (before this paper didn't even know PCA), so no informative comment from me on this one.

sandwichmaker
10012002
01:53 PM ET (US)

hey eric,
the statement has been heard before ;),
The paper definately raises more questions and possibilities than it tries to answer.
that said,
the algorithm does not assume anything about the symmetric nature of noise, this is a choice made by the algorithm about the nature of the data that he is trying to analyse.
as for uses, I have used the lee and seung's approach to do decompose an image into its dominant color channels. This requires that both the A and V matrices be positive, since the color model is additive. As you would guess the algorithm is susceptible to local minima, but it does work most of the times.
have a look at
http://rick.ucsd.edu/~sagarwal/results.html

Josh Wills
10012002
01:41 PM ET (US)

I like the main idea of the paper and I think it adresses an example of a problem faced in many areas of computer vision and learning. It seems that when a technique is found to work well (here PCA), many will apply it to every problem imaginable without concern for the actual properties of the data or the underlying assuptions of the algorithm in question. I like that they provide a reformulation for the algorithm that is tuned to specific types of data.
This reminds me a bit of Koenderink's eccv paper that criticized the use of euclidean distances in the evaluation of pixels in an image. However, I think he may have better luck in coining a new term than this paper, but both were able to confuse me a bit with their math.

Kristin Branson
10012002
01:36 PM ET (US)

In answer to Satya's question 2:
The v's correspond to the eigenvectors in the standard formulation of PCA and the a's correspond to the coefficients these eigenvectors are weighted by  theta_i = sum_i a_i * v_i. In the general formulation, the v_is are the principal components. I do not think that the a_i's and the v_i's are independent of each other, otherwise the algorithm would not have to iteratively update a's then v's. The hope with the iterative algorithm is that eventually the algorithm will converge to the optimal value for a and v.

Eric Wiewiora
10012002
01:24 PM ET (US)

Hey Sameer.
Very interesting paper. Some day I hope I can play with math to that level. I have some questions about the intuition of how the algorithm works.
There are several differences between gaussian noise and noise dteremined by another distribution, such as exponential or poisson. One of the major differences is that gaussian noise is symmetric about the mean. Does this algorithm assume that the best fit for the data is not necissarily symmetrically distant from is principal component projection?
Does this explain why the exponential distribution is less susceptible to the outliers in the first example?
The paper definately raises more questions and possibilities than it tries to answer.
Any thoughts on how you would use nongaussian PCA?

sandwichmaker
10012002
01:21 PM ET (US)

This is in reply to Andrew's question.
The second equation just states the functional form of the distribution, by virtue of that function form the term G(\theta) turns out to be a normalizer.

sandwichmaker
10012002
01:19 PM ET (US)

well the convergence of the global optima remains an open problem but that is more often than not the case with nonlinear optimization problems.
The \epsilon term is like a leash attached to the solution point. The term adds a small multiple of divergence from a fixed point to the solution. The small multiple insures that that the objective function is not modified by much in a region around that point, but as the solution starts to diverge to infinity, the term gets large and penalizes the quality of solution.
a are not eigenvalues. They are the mixing coeffecients or "bregman projections" of the data on the basis vectors in V. And we are not doing the minimization seperately, it is an alternating procedure, where we alternate between estimating each one of them.

Andrew Rabinovich
10012002
01:13 PM ET (US)

Excellent paper. I have a question regarding the second equation. After the equation has been presented, the author describes the meaning of G(theta). I understand that the integral of the conditional probability P(xtheta) should always be 1, but I don't understand why G(theta) should be subtracted in this equation. Andrew

Satya
10012002
01:00 PM ET (US)

I think it is a wonderful paper. However at the end of the paper its a bit discouraging to find that the algo is not gauranteed to converge to an optimum solution. I have the following questions/doubts:
1) Near the end of section 4 we see that an extra term \epsilon*B_F() has been added to the loss function to force the solution to a local minima ( or saddle point). I understand that the proof might be rigorous but can we have an intuitive insight into why adding such a term would make theta(t) bounded? Is such a thing also used in other optimization problems when there is a possibility of the algorithm diverging to a degenerate solution at infinity?
2) we have got two sets of parameters : a and v. To me it seems like a's represent the eigen values and v's represent the eigen vectors of the subspace onto which we are projecting our original data( am I right?). Now the assumption is that the two sets are independent of each other ( otherwise we cant do the minimizations separately). How far is the assumption valid? ( Pls ignore the question if I am not making sense. I can explain the question in class).

Sameer Agarwal
10012002
04:52 AM ET (US)

okay I will try to answer some here and the rest in the talk (after all I want an audience ;)) I should be able to explain the GLM stuff a little more clearly in the class, but here is the short answer. When we perform normal linear regression, we assume that the model is of the form y = Ax + \epsilon where \epsilon (the noise) is assumed to be a zero mean gaussian with a fixed variance. but as the paper states, there are cases where you would want a different noise model, suppose you know that the noise will always be positive, then you might want to use the poisson distribution. This extension to regression with various kinds of noise models is the subject of Generalized Linear Models. the following link should help. http://www.isds.duke.edu/computing/S/Snotes/node81.htmlThe bregman divergence is actually a meta object, parameterized by the function F. The divergence itself is a generalization of the concept of entropy, where the exact form of the generalization depends on the choice of F. For the exponential family, the choice of F = log x and calculaing the bregman divergence over the parameters, gives the KL divergence between the two distributions. as for the actual algorithm assuming orthogonality of the V_i, that is not true. The algorithm by itself does not assume anything like that. In the case of standard PCA, one has to explicitly state that the vectors v_i turn out to be orthonormal for the algorithm. The only thing stopping us from getting the same V_i is if you look carefully at the algorith, there are multiple random restarts for each v_c, and given the fact tha a representation for which all V_is are the same will be quite bad in terms of the reconstruction error/Bregman divergence will push the vectors away from each other. Edited 10012002 05:49 AM

Kristin Branson
10012002
02:49 AM ET (US)

I think Sameer has quite a task to present all the background information necessary to understand this paper. I think the basic idea of the paper is clear, simple, and very nice, but I am getting very confused by the math, It would be helpful I think to know the meanings of all the parts of equation (2). I can see what functions G(theta) and Po(x) represent in specific examples, but I don't know what the meanings of these functions are in general. I feel like they have more meaning than just being a way to split up lg(P(xtheta)) into parts by their dependencies, but maybe I am wrong. I think this might help me understand why it can be easily verified that g(theta) = E[xtheta], as I do not understand the relationship between Po(x) and the expected value of x.
I am totally confused by the whole part in the last paragraph of 2.2 (as I have not looked at the Generalized Linear Models paper) and I think it would be helpful to hear some about that.
I was also wondering what the meaning of the Bregman distance is. It seems to fit conveniently into the minimization problem, but I am interested to hear its standard context.
Finally, does the actual algorithm for minimizing the loss function assume that the components v_i are mutually orthogonal? If it does not, then how is each minimization done separately  it seems like we want the collection of l vectors v_i (and corresponding coefficients a_i) that minimize the loss function, but somehow we are able to divide this problem into finding each v_i separately. What stops you from getting the same v_i from each of these minimizations? Even if orthogonality is assumed, then how is it enforced?
