QuickTopic free message boards logo
Skip to Messages



  Spam messages 9-7 deleted by QuickTopic between 08-24-2011 10:59 PM and 07-21-2006 09:00 AM
04:35 PM ET (US)
There was a slide regarding the perturbation at the end? Could you make your point about that slide again? (it was in the applications section.)

also some minor questions:
I do not quite understand how the Hessian is practically computed, the paper says it is an approximation.

Also, wouldn't LM be stuck at local minima most of the time?
Edited 10-05-2004 08:53 PM
Steve Scher
08:41 AM ET (US)
I'd like to understand a bit better how to choose between the directions indicated by the gradient, and by the Hessian. The papers are a bit silent on this, focusing on step size, except stating that L-M's *10 or /10 rule gives a good mix.

I'm a step behind Gary in message 3 /m3: Of course 2nd-order information can be a helpful addition to the gradient, but I don't understand that in general we want to move "in the directions in which the gradient is smaller". If we did that completely, we'd be moving orthogonally to the gradient. So it must mean to deviate a bit from the gradient, and in the direction pointed to by the Hessian, but LM favors the modified-gradient direction only where the linearization breaks down - why is this a good region to modify the gradient according to the Hessian?

Regarding Louka's question about what "medium sized" means in message 2 /m2: both papers describe the limit as "hundreds of weights" with "thousands" being prohibitive.
Sanjeev Kumar
05:20 AM ET (US)
On the 4th page of paper, Roweis says linear approximation of f(w) is only valid near a minimum. I don't understand why (at least as long as we interpret "near" in euclidean distance sense) ? I tried understanding on following lines.

   Linear approximation of f(w) is valid in current neighborhood (1)
=> Quadratic approximation of E(w) is valid in current neighborhood (2)
=> we can reach minimum in 1 step (assuming exact validity) (3)
=> we are near minimum (4)

But implication (3) need not be true. It requires additional condition that neighborhood of (1) is large enough to contain minimum point and furthermore implication (4) would require different interpretation of "near".

One more (but unrelated) question: There are some methods ( e.g. Davidson-Fletcher-Powell) which update inverse of Hessian matrix based on secant equation, instead of computing it on every iteration, which can be very useful for large-sized problems. Is there any equivalent for updating inverse of (H + \lambda diag(H) ) so that it can be used in Levenberg-Marquardt algorithm ?
Gary Tedeschi
05:16 PM ET (US)
On the last page of the paper Roweis discusses the Marquardt improvement to Levenberg's algorithm. The goal of the improvement is that "we should move further in the directions in which the gradient is smaller in order to get around the classic 'error valley' problem." I understand the goal, but I am not completely clear on how introducing diag(H) achieves it(?).
Louka Dlagnekov
02:36 AM ET (US)
These papers contain quite a mouthful of Math!

Both papers claims that the LM Optimization outperforms gradient descent for medium-sized problems. What exactly does this mean? Would an ADALINE using the LM method instead of steepest descent perform better in approximating the function y=4x_1*x_2?

Also, does "medium-sized problems" mean ones with a relatively small number of weights?
Robin Hewitt
04:48 PM ET (US)
These papers are very clear presentations of the LM method...thanks for that! They did leave me with a question, though. How dependent is the success of this method on the quadratic-approximation assumption? Put differently, what happens when that assumption isn't a good one? As a follow on, is this ever an issue in practice?

Print | RSS Views: 2604 (Unique: 1271 ) / Subscribers: 3 | What's this?