QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: CSE 254 Spring 2007
Printer-Friendly Page
All messages            4-19 of 19  1-3 >>
About these ads
Who | When
Messagessort recent-top    (not accepting new messages)
Simone  4
04-23-2007 03:28 AM ET (US)
Could you clarify the log score explanation? Wouldn't sum_{over all data} log P(Class|Data) give a negative number since all the summands are logs of probabilities? How does this translate into the small positive numbers in the paper?

Thanks.
Panagiotis  5
04-23-2007 11:38 AM ET (US)
After some experimentation : The log-score is (sum_{samples} abs(log P(Class|Data)))/#samples.
Charles ElkanPerson was signed in when posted  6
04-23-2007 12:55 PM ET (US)
Yes, the log score is the average negative log probability of the true class:

(1/n) SUM_{i=1 to i=n} -log p(true class|example i)

The minus sign is there just to make the log score positive. Confusingly, we then need to minimize the log score, in order to be equivalent to maximizing the log likelihood.
Nick  7
04-23-2007 07:18 PM ET (US)
With regard to the log score, when l2-norm regularization is added, there is an extra term in our objective function that we are minimizing (i.e. -log probability + lambda * sum weights.^2).

When we have l2 regularization should we report the actual objective function (including regularization terms) or the 'pseudo score', pretending we are not including a regularization term?
Charles Elkan  8
04-23-2007 08:06 PM ET (US)
On thetraining set, I don't think it matters much whether or not you include the extra term when you compute log-score. On the test set, you should not include the extra term, so that you can compare results directly between different methods.


On Mon, 23 Apr 2007, QT - Nick wrote:

>
< replied-to message removed by QT >
Nick  9
04-23-2007 10:28 PM ET (US)
regarding training, it seems to help the optimization functions to have a function value that actually reflects the gradients.
Charles Elkan  10
04-23-2007 11:34 PM ET (US)
Can you explain more what you mean? Thanks, Charles



On Mon, 23 Apr 2007,
QT - Nick wrote:

>
< replied-to message removed by QT >
Nick  11
04-23-2007 11:47 PM ET (US)
When using lbfgs, if the gradients (including the weight regularization term) don't match the objective function, lbfgs seems to get confused (iterations start taking much longer than normal) and eventually it quits early even though the objective function seems to still be going down. This is fixed when the objective function is commensurate with the gradients. Each iteration takes about the same amount of time, the process runs for many more iterations, and the objective function gets to the point where it goes down by tiny increments.
Charles Elkan  12
04-23-2007 11:53 PM ET (US)
I'm sorry, I don't understand what you mean by "don't match" and "commensurate." The behavior you describe sounds reasonable if the gradient function is incorrect. In this case the BFGS software will be confused and behave in an undefined way. It is unfortunately very easy to write an incorrect gradient function, if you make a mistake in calculating the partial derivatives symbolically by hand, for example.

On Mon, 23 Apr 2007, QT - Nick wrote:

>
< replied-to message removed by QT >
Nick  13
04-24-2007 01:32 AM ET (US)
When adding l2 regularization, the cost function is changed to include a cost in the parameters. The derivatives change as well.

For optimization packages you must supply a cost function and a derivative function.

Assuming you had no l2 cost and you wanted to add it to an existing function, if you only change the derivatives, the cost function will no longer have the gradients that you assert it does. The optimization algorithm may become confused.

Therefore you must also add the appropriate term to the cost function.

Thus, for an assignment such as this, you need two cost functions - one which includes l2 normalization and one which doesn't. The first is the thing you are actually trying to optimize. The second is the thing you are using as a point of comparison.

Hopefully this is clear and brings the discussion back to the initial question.
Charles ElkanPerson was signed in when posted  14
04-24-2007 10:38 AM ET (US)
Nick, thanks for clarifying. Everything you wrote is correct. In other words, if you use regularization, you have to change the objective function and the gradient function. Changing just one of these functions would be a mistake.

Yes also: Generally the CLL (with or without regularization) is not exactly what we really want to optimize from an end-user perspective. For example, the "end-user" objective might be 0/1 accuracy on test data.
 
Messages 15-19 deleted by topic administrator 07-23-2009 12:59 PM
RSS link What's this?
All messages            4-19 of 19  1-3 >>
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2008 Internicity Inc. All rights reserved.