| Who | When |
Messages | |
|
|
|
| Hyun Min Kang
|
25
|
 |
|
03-15-2005 08:12 PM ET (US)
|
|
Edited by author 03-15-2005 08:13 PM
/m23 As far as I know, Wilcoxon rank-sum test is a nonparametric test to see if two distributions are independent or not (like t-test). I think it's quite different from testing if two medians are the same. For example, if a = [0 0 0 0 0 1 2 3 4], b = [-4 -3 -2 -1 0 0 0 0 0], then rank-sum test would report some 'significant' result, but actually their median is the same. Shouldn't we use a different test?
|
Charles Elkan
|
24
|
 |
|
03-15-2005 06:00 PM ET (US)
|
|
/m18 answer: You are right, I mean use the bootstrap method to investigate the distribution of the sample median.
|
Charles Elkan
|
23
|
 |
|
03-15-2005 05:55 PM ET (US)
|
|
|
Charles Elkan
|
22
|
 |
|
03-15-2005 05:50 PM ET (US)
|
|
/m16, /m19: Thanks for the explanation, Jan.
|
Charles Elkan
|
21
|
 |
|
03-15-2005 05:49 PM ET (US)
|
|
/m20 answer: The Laplace has fatter tails than the Gaussian because the probability of x decays as exp(-x) for the Laplace, as opposed to exp(-x^2) for the Gaussian.
|
| Taylor Sittler
|
20
|
 |
|
03-14-2005 02:21 PM ET (US)
|
|
Re: LaPlace
The LaPlace (double-exponential) distribution seems to have tails that are skinnier than the Gaussian. Is there a way to shape it such that it has heavy tails (ideally with variance=1)?
|
| Jan Schellenberger
|
19
|
 |
|
03-14-2005 02:56 AM ET (US)
|
|
/m16 The scaling is important to give each feature an equal chance at contributing. Let's say feature 1 is an excellent predictor of the output, however, the variance of feature 1 is tiny. Then in order to fit a good model, the b coefficient of this feature will have to be huge. However, in ridge regression we are trying to also shrink b as we fit the model, so a ridge fitted model may ignore feature 1 in favor of other features which have a bigger variance even though they are worse predictors. Normalizing each feature eliminates this problem by making the 'average' b for each feature about the same. -Jan
|
| Jan Schellenberger
|
18
|
 |
|
03-14-2005 02:34 AM ET (US)
|
|
For Problem 2a)
What does it mean to estimate the median of a distribution using bootstrapping. I can see how you can estimate the median from a sample. I don't see how bootstraping helps. Bootstrapping may be useful to figure out the distribution of the median. Is that the question?
-Jan
|
| Banu Dost
|
17
|
 |
|
03-12-2005 03:48 AM ET (US)
|
|
For problem 2 part b, is using the absolute value of the difference between two medians as our test statistic good idea? Or should it be something more complicated?
Banu
|
| Banu Dost
|
16
|
 |
|
03-12-2005 03:43 AM ET (US)
|
|
In ridge regression, I do not see the point of standardizing the data by shifting and scaling. If we shift the data, the b values do not change, except b0. If we scale it by 1/std of the column then b values are scaled by std of column itself? But we still have the same predicted y vector. So, what do we gain by standardizing?
|
Charles Elkan
|
15
|
 |
|
03-10-2005 07:59 PM ET (US)
|
|
/m11 answer: The Laplace distribution has finite variance, and tails that are heavier than the Gaussian's. The Pareto distribution has even heavier tails. See pages 623 and 625 of Casella and Berger.
|
Charles Elkan
|
14
|
 |
|
03-10-2005 07:53 PM ET (US)
|
|
/m12 answer: What's important is the yhat predictions for test examples. I think these will be the same whether you scale the test data, or you rescale back the training data, because the scaling is a linear transformation. If you rescale back the training data, there is a simple formula for rescaling the coefficients b.
|
Charles Elkan
|
13
|
 |
|
03-10-2005 07:50 PM ET (US)
|
|
/m10 answer: I don't think you can say that any value for lambda is intrinsically large or small. What's more meaninful is how large SUM (y - b0 - SUM xj*bj)^2 is relative to lambda*SUM_j>=1 bj^2. If the latter is smaller than the former, that says the ridge regession is not changing the answer much. This is what I would expect if all the predictors are useful, and the sample size is large.
|
| Stephen Krotosky
|
12
|
 |
|
03-09-2005 06:41 PM ET (US)
|
|
Also for Problem 1,
Once we generate the best set b, do we rescale back to the original means and variances, or do we scale our new test data using the same scaling factors that we scaled our training data. The test data won't exactly have zero mean and unit variance, but if the training and test data each represent the underlying distributions well, it seems like the error would be small? Also if we rescale back, how does this affect our b values.
Thanks
|
| Stephen Krotosky
|
11
|
 |
|
03-09-2005 06:39 PM ET (US)
|
|
Question for Problem 2:
I've figured out how to generate the other distributions, but what is an example of a distribution with heavy tails and finite variance?
|
| Stephen Krotosky
|
10
|
 |
|
03-09-2005 06:38 PM ET (US)
|
|
Edited by author 03-09-2005 06:38 PM
Some more questions on problem 1:
After doing forward selection with Sidak, I find 7 significant X parameters. I then scale and shift those values to give zero-mean and unit-variance. Next, I try to perform ridge regression using 10-fold cross validation to find the best lambda.
My question is that I get MSE on the order of about 2700 and I get optimal lambda values between 1000 and 1600, roughly, depending on how I randomly permute my X data into 10 sections. Does this seem like a reasonable value? I was thinking that lambda would be of considerably less magnitude.
Also how do I resolve the problem that I have such a wide range of lambda's depending in how I divide up the data. I suppose I could do repeated attempts and take an average, but that seems computationally expensive, since I would have to range over hundreds of possible lambdas.
Thanks
|