| Who | When |
Messages | |
|
|
|
|
|
| Jonathan Ultis
|
10
|
 |
|
04-13-2001 06:37 PM ET (US)
|
|
The multiplicative error is harder to get a handle on then the additive error. It doesn't shift the location of zero error. In addition, it doesn't add any zero error locations since max(d_r/d_t, d_t/d_r) is always 1 or larger. However, it looks to me like it might introduce new local minima.
I don't really want to take the derivative of it because I'm lazy, but given
let u = sum_r(((h(x_r) - y_r))^2)/r (mean squared error) let v = sum_r(((h(x_r) - O(x_r)))^2)/r let w = sum_t(((h(x_t) - O(x_t)))^2)/t let s = uv
then the error is in the form
error = s/w
the derivative is
(w ds/dx - s dw/dx) / w^2
which is
(w (u dv/dx + v du/dx) - uv dw/dx) / w^2
the polynomial in the numerator could be at most order 5 and the polynomial in the denominator could be at most order 4, since w, u, and v are quadratic and (dv, du, dw)/dx is linear.
If things don't cancel out, it could be that the new error function creates a surface with many more local minima then the original error surface. Could it be that this method works by causing learning routines to get stuck in well placed local minima instead of allowing them to find the global minimum?
Anyone want to solve the derivative and find out?
|
| Jonathan Ultis
|
9
|
 |
|
04-13-2001 03:55 PM ET (US)
|
|
Edited by author 04-13-2001 04:03 PM
Oops - I swapped _r and _t from what they were in the paper. Sorry about that. In the work below, _r is labelled training data and _t is all data.
The error functions behave in an unintuitive way for me.
Say, for example, you have 3 colinear points. In particular, consider these colinear points (assume a little noise to make the problem harder if you like).
(0, 0), (0.1, 0.1), (1, 1)
assume (0, 0) and (1, 1) are labelled, but (0.1, 0.1) is not. Let PHI (which I'll write as O(x)) be the mean of the labelled points - O(x) = 0.5
Now consider the additive error
error = sum_r((h(x_r) - y_r)^2)/r + d_r - d_t which is
sum_r((h(x_r) - y_r)^2)/r + sum_r((h(x_r) - O(x_r))^2)/r - sum_t((h(x_t) - O(x_t))^2)/t
take the derivative and combine the two sums over r. We want to find local minima or maxima, so set it equal to 0. I also factored out a 2. I'm putting in r=2 and t=3 to simplify the math.
0 = sum_r(2h(x_r) - y_r - O(x_r))/2 - sum_t(h(x_t) - O(x_t))/3
doing the sums by hand...
sum_t(h(x_t) - O(x_t)) = h(0) - O(0) + h(0.1) - O(0.1) + h(1) - O(1)
sum_r(2h(x_r) - y_r - O(x_r)) = 2(h(0) + h(1)) - (0 + 1) - (O(0) + O(1))
together (this one is complicated and I could have messed it up)
0 = 4h(0) + 4h(1) - 2h(0.1) - O(0) - O(1) + 2O(0.1) - 3
since O(x) = 0.5
0 = 4h(0) + 4h(1) - 2h(0.1) - 5
let h(x) be the best fit straight line that minimizes the error
h(x) = ax + b
0 = 4(0 a + b) + 4(1 a + b) - 2(0.1 a + b) - 5
0 = 3.8a + 7.8b - 5
1.31 - 2.05 b = a
So, if we apply the correct slope for a (1), we end up with b = 0.15, which puts the line through (0, 0.15), (1, 1.15).
If we apply the correct b (0), we end up with a = 1.31, which puts the line through (0, 0), (1, 1.31)
Neither of these are correct, which suggests that the additive error will not work well in practice.
I had a mistake in my math the first time I worked this that made the line come out correctly for (0, 0), (0.5, 0.5), and (1, 1). I'm not sure if what I have now is entirely correct. If anyone finds a mistake, please let me know.
|
| Joe Drish
|
8
|
 |
|
04-11-2001 03:46 PM ET (US)
|
|
"but I could not find an explanation of why they considered it so".
The authors mention that "although the hard max ratio penalty in (5) appears to be ad hoc, this objective is not entirely unprincipled." They go on to say that "in the special case where the origin function happens to be close to the Bayes-optimal predictor, minimizing (5) will result in near optimal generalization performance in the base hypothesis class".
|
| Melanie Dumas
|
7
|
 |
|
04-11-2001 01:09 PM ET (US)
|
|
While the general idea is sound, I found the paper rather difficult to read, due to undefined terms (the definition of a metric space?) and formatting errors. The figures contained unlabelled data axis, and seemed too cluttered with numeric data.
I'm unsure if the benefits of using this algorithm over a simple parameter optimization script.
|
| Hector Jasso
|
6
|
 |
|
04-11-2001 01:03 PM ET (US)
|
|
Although the general idea of considering unlabeled data to guide the choice of hypothesis is an interesting idea, the method the authors use is not very well founded theoretically. For example, the paper says "we have found that the multiplicative objective generally performs much better [compared to the additive objective]" but I could not find an explanation of why they considered it so. So, I thought that the tests they did to check their algorithm (only four tests for polynomials, for example) was not enough, even considering that they compared it with many other alternate algorithms like REG, ADJ, etc.
Further work on the paper could therefore involve exploring a more solid theoretical foundation for the results found, and also testing with real-life data. In general, the approach is very promising, though.
|
| Aldebaro Klautau
|
5
|
 |
|
04-11-2001 12:29 PM ET (US)
|
|
I was a bit confused with the notation d(.,.) for both d(h,phi) and d(h,P(y|x)) that are different concepts in my opinion. About the phi function: in the first paragraph of section 4 the authors mention "we usually just set phi to the zero function or a constant at the mean of y labels".
|
| Sameer Agarwal
|
4
|
 |
|
04-11-2001 06:50 AM ET (US)
|
|
hi, frankly I do not understand how or why the technique is working. My confusion comes from the arbitrariness of the choice of the basline function phi.
since we are choose whatever we want.. atleast the paper does not suggest anything.. why not just chose phi(x) = 0, and not keep it in the equation at all ? Removes one more parameter from your test setup.. and can make it simpler still. In that case all that we are doing is .. assuming that we have enough unlabelled points available.. in the limit we get..
is
\sum h(x_i)^2 ---------- E (h(x)^2)
in which case it become a comparison of how well the second moment of the h(x) is captured by the empirical second order moment of h(x) . Since deviatons from it are penalized.
so is generalization capbility captured by how well the second moment of hypothesis is captured by the empirical estimate just using the training data.. one nice thing is.. that as the number of training samples go up.. the term ratio goes to 1 as expected. Since in the limit of infinite training data we will be able to construct a perfect fit.
so my questions are :
1. Why do you need phi(x) ? 2. If you indeed need phi(x).. what is the criterion that you can use to select one.. 3. (on a slightly wild note ) is it possible to prove a bound like which connects .. d(phi(x),f(x)) (where f(x) is the actual function underlying the data,) with how good is phi(x) in predicting the goodness of fit. I get the idea from A* where the closer the heuristic h is to the actual distance h* the lesser is the number of states one has to explore...
finally I agree with Greg, I think the technique will not be able to differentiate between failry similar hypothesis.. on the other hand.. perhaps that is what phi(x) is supposed to do.. that is if we use a nontrivial phi(x) we will be able to do the differentiation.
Also the authors do not mention what phi they used for the polynomial regression test.
sameer
|
| Dave Kauchak
|
3
|
 |
|
04-10-2001 08:39 PM ET (US)
|
|
I thought the idea of using unlabeled data to help increase the performance of the training was a good addition.
I did, however, have some reservations with the comparisons that the paper made with the other algorithms (REG, MAP, ADJ, etc.). The tests that the paper provided used both labeled and unlabeled data for the training. But (someone correct me if I'm wrong), it seems that the only algorithm that actually uses the unlabeled data is the ADA algorithm discussed in the paper. This does give the ADA algorithm a slight advantage because it is provided with more information than the others. Also, this is a comparison between algorithms with similarities but seem to fall in slightly different classes. Because of this, I think that we should be careful in interpreting the results of the tests.
|
| Bianca Zadrozny
|
2
|
 |
|
04-10-2001 07:02 PM ET (US)
|
|
d_hat() is an estimated distance, while d() is the true distance. The paper assumes that there is a small number of labeled examples and a large number of unlabeled examples, So, when we use the labeled examples we are calculating an estimated distance (d_hat()). When we use the unlabeled examples we have a true distance (d()) because we are using lots of data points to compute the distance (although, in my opinion, this is still an estimate).
|
| Greg Hamerly
|
1
|
 |
|
04-10-2001 05:18 PM ET (US)
|
|
I enjoyed reading this paper. Something I have been interested in is what to do with all kinds of data that don't have labels, but that could be exploited.
I'm still confused on the notation of d() versus d_hat(). The latter (d_hat), I can see is the training error. The former (d) is posed as an integral -- I can only assume that actually means a sum over the unlabeled data points. Can anyone else put the meaning of d() and its ratio with d_hat() into plain english?
Working on intuition, this technique would seem to work at discriminating between widely different hypotheses (i.e. a 2nd-order polynomial versus a 10th-order polynomial), but I don't see that it has much power to discriminate among fairly similar hypotheses. Thus it seems to me to be a technique used for very gross model selection.
- greg
|
|