QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: CSE 291 Assignment 5, Winter 2005
Views: 3001, Unique: 802 
Subscribers: 1
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
All messages    << 15-30  1-14 of 46        
About these ads
Who | When
Messagessort recent-bottom   
Post a new message
 
Charles ElkanPerson was signed in when posted  14
03-10-2005 07:53 PM ET (US)
/m12 answer: What's important is the yhat predictions for test examples. I think these will be the same whether you scale the test data, or you rescale back the training data, because the scaling is a linear transformation.

If you rescale back the training data, there is a simple formula for rescaling the coefficients b.
Charles ElkanPerson was signed in when posted  13
03-10-2005 07:50 PM ET (US)
/m10 answer: I don't think you can say that any value for lambda is intrinsically large or small.

What's more meaninful is how large SUM (y - b0 - SUM xj*bj)^2 is relative to lambda*SUM_j>=1 bj^2.

If the latter is smaller than the former, that says the ridge regession is not changing the answer much. This is what I would expect if all the predictors are useful, and the sample size is large.
Stephen Krotosky  12
03-09-2005 06:41 PM ET (US)
Also for Problem 1,

Once we generate the best set b, do we rescale back to the original means and variances, or do we scale our new test data using the same scaling factors that we scaled our training data. The test data won't exactly have zero mean and unit variance, but if the training and test data each represent the underlying distributions well, it seems like the error would be small? Also if we rescale back, how does this affect our b values.

Thanks
Stephen Krotosky  11
03-09-2005 06:39 PM ET (US)
Question for Problem 2:

I've figured out how to generate the other distributions, but what is an example of a distribution with heavy tails and finite variance?
Stephen Krotosky  10
03-09-2005 06:38 PM ET (US)
Edited by author 03-09-2005 06:38 PM
Some more questions on problem 1:

After doing forward selection with Sidak, I find 7 significant X parameters. I then scale and shift those values to give zero-mean and unit-variance. Next, I try to perform ridge regression using 10-fold cross validation to find the best lambda.

My question is that I get MSE on the order of about 2700 and I get optimal lambda values between 1000 and 1600, roughly, depending on how I randomly permute my X data into 10 sections. Does this seem like a reasonable value? I was thinking that lambda would be of considerably less magnitude.

Also how do I resolve the problem that I have such a wide range of lambda's depending in how I divide up the data. I suppose I could do repeated attempts and take an average, but that seems computationally expensive, since I would have to range over hundreds of possible lambdas.

Thanks
Charles ElkanPerson was signed in when posted  9
03-08-2005 03:09 PM ET (US)
/m7 answer: Jan's suggestions in /m8 are good.

MSE is like RSS, but RSS is computed on the training set while MSE is computed on a test set (possibly by cross-validation).

With Sidak, it is reasonable to let n be the number of features considered, each time you add one feature. This neglects the fact that you are doing essentially O(n^2) tests since you add one feature n times, but heuristically doing Sidak with n^2 instead of n might be just too strict. In any case, your tests are not independent, especially not different tests of adding the same variable.

To deal with the fact that a given threshold like 0.05 might be too strict, or using n instead of n^2 might be too lenient, trying different thresholds like you try different lambdas is a good idea.
Jan Schellenberger  8
03-08-2005 02:08 PM ET (US)
/m7 You don't have to combine your b's while using ridge regression. You can just use the lambda value from the cross validation and perform one final calculation of b based on all the data.

It seems that mean((y-yhat)^2) seems like a good indicator of error so you can compare across different datasets/methods.

The x-axis can be lambda in the case of the ridge regression. For feature selection I was thinking of letting it be different cutoff values. .05 just seems arbitrary.
Stephen Krotosky  7
03-08-2005 01:58 PM ET (US)
Some more Problem 1 questions:

My scheme is to do forward selection using the Sidak procedure and then do ridge regression on those selected features.

For Sidak, each time we try to add another feature, our Sidak threshold 1 - (1 - a)^{1/n}. Is n is the number of remaining features to try?
For ridge regression, when we do cross-validation, I have a couple questions:
1.) When we do say 10-fold cross validations, we get 10 different b vectors. Is there a best way to combine them.
2.) Also, when we compute MSE, what exactly should we compute. Is it simply (y-yhat)^2 or do we compute something else because this seems just like RSS.

Also, for the error bars, and MSE plots, could you explain what should be the x-axis of these plots?

Thanks
Jan Schellenberger  6
03-08-2005 03:03 AM ET (US)
for P1: I understand how forward feature selection works. I haven't implemented backwards feature selection but I understand it.

How do you implement a mixed adding/removing scheme that isn't 100% greedy? It seems like if you have 5 features, you would want to either add another feature that gives a high F-value (low alpha) or remove a feature such that the difference in models has a low F-value (high alpha). I don't see how you can necessarily decide which one is better.
Charles Elkan  5
03-05-2005 01:38 AM ET (US)
/m3 answer: I am not familiar with these particular Matlb functions. You may use them, but only if you understand exactly what they do. You may find it easier and more educational to write your own code--this does not have to be lengthy.
Charles Elkan  4
03-05-2005 01:36 AM ET (US)
/m2 answer: Replacing each missing value by the mean of its column is a reasonable simple approach. Another option is to leave out columns and/or rows with too many missing values.

Dealing well with missing values is a whole research area in its own right.
Stephen Krotosky  3
03-04-2005 08:06 PM ET (US)
I have a MATLAB question on Problem 1

Is it acceptable to use "ridge" and "stepwisefit" from the matlab stat toolbox? If so, I have questions on how to get the b_0 intercept. We don't seem to include the column of 1's in the X matrix, yet the b0 value isn't returned. I've tried including it also, but it never gets selected as a feature, which doesn't seem right. b0 should likely be non-zero.

Any ideas/help on usage of the two functions??

Thanks
Ryan Kelley  2
03-04-2005 07:52 PM ET (US)
On question 1, what is a reasonable method for dealing with missing values? It seems like a bad idea to leave these as 0. Should we remap these values to the mean for the column?
Charles ElkanPerson was signed in when posted  1
03-01-2005 02:59 PM ET (US)
Please ask questions here about the fifth (and last!) assignment for 291.
RSS link What's this?
All messages    << 15-30  1-14 of 46        
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2008 Internicity Inc. All rights reserved.