| Who | When |
Messages | |
|
|
|
| Aksaray
|
46
|
 |
|
01-25-2009 06:20 AM ET (US)
|
|
|
| Benimsayfam
|
45
|
 |
|
01-25-2009 06:20 AM ET (US)
|
|
|
| Sohbet
|
44
|
 |
|
01-25-2009 06:20 AM ET (US)
|
|
|
| sohbet
|
43
|
 |
|
01-02-2009 08:28 AM ET (US)
|
|
|
| çet
|
42
|
 |
|
01-02-2009 06:25 AM ET (US)
|
|
|
| |
Messages 41-35 deleted by topic administrator between 07-22-2008 05:11 AM and 02-25-2008 11:11 AM |
| Lauren
|
34
|
 |
|
07-21-2006 11:47 PM ET (US)
|
|
Wonderful topic, please keep up the good work. I'm a teacher, and a discussion such as yours is already being recommended by me. Visit d amphetamine webpage devoted to d amphetamine. azithromycin tablet webpage devoted to azithromycin tablet. and have fun!
|
| |
Messages 33-29 deleted by topic administrator between 07-23-2006 02:07 AM and 07-21-2006 09:01 AM |
Charles Elkan
|
28
|
 |
|
03-17-2005 12:40 AM ET (US)
|
|
/m26 answer: For 2(a) we are interested in the variance of the bootstrap-based estimate of the median, compared to the variance of (for example) the sample mean. Each of these variances is a function of the sample size N, so the efficiency is a function of N. For 1, I mean give a forecast confidence interval of the MSE of your model, so you can evaluate if the true MSE falls within this confidence interval.
|
Charles Elkan
|
27
|
 |
|
03-17-2005 12:33 AM ET (US)
|
|
/m25 reply: I think Hyun Min Kang is right that the Wilcoxon rank-sum test is not a test for whether or not two medians are the same in general. It is a test of H0: two distributions have the same shape and the same median versus H1: the two distributions have the same shape but different medians.
|
| samory
|
26
|
 |
|
03-16-2005 12:41 AM ET (US)
|
|
About Message 18. If you want us to estimate the distribution of the median in Pb 2, do you then just want us to this for a few different sample sizes. (You said a range of sample sizes, which seem to mean that you would like to see the behavior over different sizes, but if we have to estimate the actual distribution for each sample size, it'll be hard to report as a function over a range...) An alternative would be to estimate the confidence intervals obtained for the median, and report this over a range of sample sizes. Would this be fine?
About Pb 1, what do you mean by "error bars" for giving a forecast of the MSE of our final model ?
|
| Hyun Min Kang
|
25
|
 |
|
03-15-2005 08:12 PM ET (US)
|
|
Edited by author 03-15-2005 08:13 PM
/m23 As far as I know, Wilcoxon rank-sum test is a nonparametric test to see if two distributions are independent or not (like t-test). I think it's quite different from testing if two medians are the same. For example, if a = [0 0 0 0 0 1 2 3 4], b = [-4 -3 -2 -1 0 0 0 0 0], then rank-sum test would report some 'significant' result, but actually their median is the same. Shouldn't we use a different test?
|
Charles Elkan
|
24
|
 |
|
03-15-2005 06:00 PM ET (US)
|
|
/m18 answer: You are right, I mean use the bootstrap method to investigate the distribution of the sample median.
|
Charles Elkan
|
23
|
 |
|
03-15-2005 05:55 PM ET (US)
|
|
|
Charles Elkan
|
22
|
 |
|
03-15-2005 05:50 PM ET (US)
|
|
/m16, /m19: Thanks for the explanation, Jan.
|
Charles Elkan
|
21
|
 |
|
03-15-2005 05:49 PM ET (US)
|
|
/m20 answer: The Laplace has fatter tails than the Gaussian because the probability of x decays as exp(-x) for the Laplace, as opposed to exp(-x^2) for the Gaussian.
|
| Taylor Sittler
|
20
|
 |
|
03-14-2005 02:21 PM ET (US)
|
|
Re: LaPlace
The LaPlace (double-exponential) distribution seems to have tails that are skinnier than the Gaussian. Is there a way to shape it such that it has heavy tails (ideally with variance=1)?
|
| Jan Schellenberger
|
19
|
 |
|
03-14-2005 02:56 AM ET (US)
|
|
/m16 The scaling is important to give each feature an equal chance at contributing. Let's say feature 1 is an excellent predictor of the output, however, the variance of feature 1 is tiny. Then in order to fit a good model, the b coefficient of this feature will have to be huge. However, in ridge regression we are trying to also shrink b as we fit the model, so a ridge fitted model may ignore feature 1 in favor of other features which have a bigger variance even though they are worse predictors. Normalizing each feature eliminates this problem by making the 'average' b for each feature about the same. -Jan
|
| Jan Schellenberger
|
18
|
 |
|
03-14-2005 02:34 AM ET (US)
|
|
For Problem 2a)
What does it mean to estimate the median of a distribution using bootstrapping. I can see how you can estimate the median from a sample. I don't see how bootstraping helps. Bootstrapping may be useful to figure out the distribution of the median. Is that the question?
-Jan
|
| Banu Dost
|
17
|
 |
|
03-12-2005 03:48 AM ET (US)
|
|
For problem 2 part b, is using the absolute value of the difference between two medians as our test statistic good idea? Or should it be something more complicated?
Banu
|
| Banu Dost
|
16
|
 |
|
03-12-2005 03:43 AM ET (US)
|
|
In ridge regression, I do not see the point of standardizing the data by shifting and scaling. If we shift the data, the b values do not change, except b0. If we scale it by 1/std of the column then b values are scaled by std of column itself? But we still have the same predicted y vector. So, what do we gain by standardizing?
|
Charles Elkan
|
15
|
 |
|
03-10-2005 07:59 PM ET (US)
|
|
/m11 answer: The Laplace distribution has finite variance, and tails that are heavier than the Gaussian's. The Pareto distribution has even heavier tails. See pages 623 and 625 of Casella and Berger.
|
Charles Elkan
|
14
|
 |
|
03-10-2005 07:53 PM ET (US)
|
|
/m12 answer: What's important is the yhat predictions for test examples. I think these will be the same whether you scale the test data, or you rescale back the training data, because the scaling is a linear transformation. If you rescale back the training data, there is a simple formula for rescaling the coefficients b.
|
Charles Elkan
|
13
|
 |
|
03-10-2005 07:50 PM ET (US)
|
|
/m10 answer: I don't think you can say that any value for lambda is intrinsically large or small. What's more meaninful is how large SUM (y - b0 - SUM xj*bj)^2 is relative to lambda*SUM_j>=1 bj^2. If the latter is smaller than the former, that says the ridge regession is not changing the answer much. This is what I would expect if all the predictors are useful, and the sample size is large.
|
| Stephen Krotosky
|
12
|
 |
|
03-09-2005 06:41 PM ET (US)
|
|
Also for Problem 1,
Once we generate the best set b, do we rescale back to the original means and variances, or do we scale our new test data using the same scaling factors that we scaled our training data. The test data won't exactly have zero mean and unit variance, but if the training and test data each represent the underlying distributions well, it seems like the error would be small? Also if we rescale back, how does this affect our b values.
Thanks
|
| Stephen Krotosky
|
11
|
 |
|
03-09-2005 06:39 PM ET (US)
|
|
Question for Problem 2:
I've figured out how to generate the other distributions, but what is an example of a distribution with heavy tails and finite variance?
|
| Stephen Krotosky
|
10
|
 |
|
03-09-2005 06:38 PM ET (US)
|
|
Edited by author 03-09-2005 06:38 PM
Some more questions on problem 1:
After doing forward selection with Sidak, I find 7 significant X parameters. I then scale and shift those values to give zero-mean and unit-variance. Next, I try to perform ridge regression using 10-fold cross validation to find the best lambda.
My question is that I get MSE on the order of about 2700 and I get optimal lambda values between 1000 and 1600, roughly, depending on how I randomly permute my X data into 10 sections. Does this seem like a reasonable value? I was thinking that lambda would be of considerably less magnitude.
Also how do I resolve the problem that I have such a wide range of lambda's depending in how I divide up the data. I suppose I could do repeated attempts and take an average, but that seems computationally expensive, since I would have to range over hundreds of possible lambdas.
Thanks
|
Charles Elkan
|
9
|
 |
|
03-08-2005 03:09 PM ET (US)
|
|
/m7 answer: Jan's suggestions in /m8 are good. MSE is like RSS, but RSS is computed on the training set while MSE is computed on a test set (possibly by cross-validation). With Sidak, it is reasonable to let n be the number of features considered, each time you add one feature. This neglects the fact that you are doing essentially O(n^2) tests since you add one feature n times, but heuristically doing Sidak with n^2 instead of n might be just too strict. In any case, your tests are not independent, especially not different tests of adding the same variable. To deal with the fact that a given threshold like 0.05 might be too strict, or using n instead of n^2 might be too lenient, trying different thresholds like you try different lambdas is a good idea.
|
| Jan Schellenberger
|
8
|
 |
|
03-08-2005 02:08 PM ET (US)
|
|
/m7 You don't have to combine your b's while using ridge regression. You can just use the lambda value from the cross validation and perform one final calculation of b based on all the data. It seems that mean((y-yhat)^2) seems like a good indicator of error so you can compare across different datasets/methods. The x-axis can be lambda in the case of the ridge regression. For feature selection I was thinking of letting it be different cutoff values. .05 just seems arbitrary.
|
| Stephen Krotosky
|
7
|
 |
|
03-08-2005 01:58 PM ET (US)
|
|
Some more Problem 1 questions:
My scheme is to do forward selection using the Sidak procedure and then do ridge regression on those selected features.
For Sidak, each time we try to add another feature, our Sidak threshold 1 - (1 - a)^{1/n}. Is n is the number of remaining features to try? For ridge regression, when we do cross-validation, I have a couple questions: 1.) When we do say 10-fold cross validations, we get 10 different b vectors. Is there a best way to combine them. 2.) Also, when we compute MSE, what exactly should we compute. Is it simply (y-yhat)^2 or do we compute something else because this seems just like RSS.
Also, for the error bars, and MSE plots, could you explain what should be the x-axis of these plots?
Thanks
|
| Jan Schellenberger
|
6
|
 |
|
03-08-2005 03:03 AM ET (US)
|
|
for P1: I understand how forward feature selection works. I haven't implemented backwards feature selection but I understand it.
How do you implement a mixed adding/removing scheme that isn't 100% greedy? It seems like if you have 5 features, you would want to either add another feature that gives a high F-value (low alpha) or remove a feature such that the difference in models has a low F-value (high alpha). I don't see how you can necessarily decide which one is better.
|
| Charles Elkan
|
5
|
 |
|
03-05-2005 01:38 AM ET (US)
|
|
/m3 answer: I am not familiar with these particular Matlb functions. You may use them, but only if you understand exactly what they do. You may find it easier and more educational to write your own code--this does not have to be lengthy.
|
| Charles Elkan
|
4
|
 |
|
03-05-2005 01:36 AM ET (US)
|
|
/m2 answer: Replacing each missing value by the mean of its column is a reasonable simple approach. Another option is to leave out columns and/or rows with too many missing values. Dealing well with missing values is a whole research area in its own right.
|
| Stephen Krotosky
|
3
|
 |
|
03-04-2005 08:06 PM ET (US)
|
|
I have a MATLAB question on Problem 1
Is it acceptable to use "ridge" and "stepwisefit" from the matlab stat toolbox? If so, I have questions on how to get the b_0 intercept. We don't seem to include the column of 1's in the X matrix, yet the b0 value isn't returned. I've tried including it also, but it never gets selected as a feature, which doesn't seem right. b0 should likely be non-zero.
Any ideas/help on usage of the two functions??
Thanks
|
| Ryan Kelley
|
2
|
 |
|
03-04-2005 07:52 PM ET (US)
|
|
On question 1, what is a reasonable method for dealing with missing values? It seems like a bad idea to leave these as 0. Should we remap these values to the mean for the column?
|
Charles Elkan
|
1
|
 |
|
03-01-2005 02:59 PM ET (US)
|
|
Please ask questions here about the fifth (and last!) assignment for 291.
|