I'd like to understand a bit better how to choose between the directions indicated by the gradient, and by the Hessian. The papers are a bit silent on this, focusing on step size, except stating that L-M's *10 or /10 rule gives a good mix.
I'm a step behind Gary in message 3
/m3: Of course 2nd-order information can be a helpful addition to the gradient, but I don't understand that in general we want to move "in the directions in which the gradient is smaller". If we did that completely, we'd be moving orthogonally to the gradient. So it must mean to deviate a bit from the gradient, and in the direction pointed to by the Hessian, but LM favors the modified-gradient direction only where the linearization breaks down - why is this a good region to modify the gradient according to the Hessian?
Regarding Louka's question about what "medium sized" means in message 2
/m2: both papers describe the limit as "hundreds of weights" with "thousands" being prohibitive.