Murat Aykut
11092006
07:45 AM ET (US)

I'm studying on Pattern Recognition, especially AdaBoost. For this purpose, I read "Experiments with a New Boosting Algorithm" Y. Freund, R. Schapire paper and understood AdaBoost.M1 algorithm. But I have a few questions about AdaBoost.M2. "Call WeakLearn, providing it with mislabel distribution Dt.". What does this sentence mean? I want to use Euclidean Linear Discriminant classifier with AdaBoost. But, how can I give the distribution to the weak learner? In Adaboost.M1, because of all examples have one value, it didn't cause any problem. But, in Adaboost.M2 an example have K1 value (corresponding with mislabels).
Please help me.
Murat.

Gyozo Gidofavi
10232001
12:40 PM ET (US)

I agree with the comment made by David about the pseudo loss. I was a little confused about its intuitive meaning and purpose.
As a reply or extension to the message posted by HsinHao, i have an intuitive feeling that one can find real life examples where a series of human experts learn in a similar manner to AdaBoost, but i could not find and instance of this myself yet. I would be interested, if any of you would have an example that would be applicable.
I found the remark made about validity of Occam's razor in 4.1 of the first paper very interesting. At first i thought that one of the reasons why AdaBoost's performance is so exceptional is that it finds "independent, new" weak classifiers even when the training error of the previous classifier was 0. This could explain, how it would be still possible for AdaBoost to improve the test error when the training error is already 0. It would also possibly add an advantage to AdaBoost over traditional gradient descant methods which can easily overfit and are not able to improve on the test error once the training error is 0. When looking at the formula for the calculation for the distribution D_t+1 in the algorithm it was not clear however what this distribution would be in case the previous wear classifier had 0 training error.

andrew cosand
10232001
06:42 AM ET (US)

After reading Freund and Schapire, the first question that comes to mind is: if a classifier does worse than random guessing, why not just negate it? Now its better ;) The other thing that came to mind is that if you have some Bernouli trial (a classification) you can take a series of them and you have a binomial distribution. I can't remeber too much of the math right now, but as I recall there's some way to extract a much better classification from the binomial than from the individual Bernouli trials. So myquestion is then how do these boosting algorithms compare to simple combination of single classifications. Does one mathematically reduce to the other? To get a binomial I believe that the Bernouli trials must be independent how independent are the different sets of training examples that Freund and Schapire use? Edited 10232001 06:42 AM

HsinHao Yu
10232001
05:43 AM ET (US)

1. This is probably an idiotic suggestion, but I can't resist proposing the following experiment  since the boosting procedure can be applied to ANY learning device, we can certainly apply it to human learners. Let's train a series of human experts on a classification task (eg. identifying obscure insects, for example) according to the principles of AdaBoost, and see if this is more efficient than training one single expert. Of course "human nature" will make the mathematical arguement in the papers invalid, but what the heck.
2. Maybe it's because I didn't really follow the mathematics of the second paper, but I am quite skeptical about training a community of experts to vote. I read an example of AdaBoost somewher, where the trained individual expert solve the problem incorrectly (eg. they capture the underlying distribution of data very wrong), but the nonlinear voting procedure somehow made the net result quite accurate. Yes, this is the whole point of community machine, but intuitively, if each expert simply doesn't "get it", we cannot expect that the voting procedure will magically make the problem solvable.
A paper by Jeff Elman (the "starting small" paper) showed that in order to solve a problem (especially when you are using recurrent nets), sometimes the network need to be trained on a much easier task. Without it the network is not able to solve the problem. It seems that there are a lot of unexplored possibilities of reassigning the distributions of the training data, possibily involving making the problem easier in some nontrivial way, instead of AdaBoost's "making it harder" strategy.

Joe Drish
10232001
03:21 AM ET (US)

It said that in the paper on page 6 that pseudoloss (AdaBoost.M2) and error (AdaBoost.M1) were identical when applying them on twoclass problems. It also said that on page 3 that the main disadvantage of AdaBoost.M1 was that it is unable to handle weak hypotheses with error greater than 1/2. My question is what improvement can you make in the two class case when the weak hypotheses have error greater than 1/2. You would not be able to use pseudoloss, or AdaBoost.M2, since it would just be identical to error, or AdaBoost.M1. I suppose in practice it is typical for weak hypotheses to be less than 1/2 in the two class case, making it unnecessary to think about that problem anyway. I think that they maybe should omitted the second part of the paper, or the part where they talk about the performance of using a nearestneighbor classifier on an OCR problem. I agree with Junwen that they should have expanded on this idea to include a discussion about how it could be used to improve SVMs and other related algorithms, even though SVMs at the time were not very wellknown and understood. But I do think they should have limited the paper just to their first discussion and analysis. So perhaps this paper should be divided into two separate papers. What interested me about the paper was the differentiation of plausiblity and probability. I would be curious to hear an explanation of this distinction in tomorrow's presentation. Overall, I thought the paper was good and the idea of pseudoloss is neat. Edited 10232001 03:24 AM

Dave Kauchak
10232001
03:20 AM ET (US)

To try and answer one of your questions Junwen: 1. You can think of the distribution of D as a weighting of the importance of the specific training examples. Our weak learner will return some that classifies examples (i.e. a function from the X to Y). That rule will presumably predict some number of training examples correctly and some number incorrectly. The weak learner is trying to decide a rule so as to minimuze some function of correctly classified and incorrectly classified examples (including the distribution D, or weighting). A simple approach to decide the best rule would just be to sum up the weights of the correctly classified training examples and sum the weights of the incorrectly classified examples and take the difference of the two. The rule that maximizes this values is the best rule. You can think of more elaborate algorithms for using these weights, but the key idea is to think of the weights as a rank of importance. So you want to pick a rule that will correctly pick the higher weighted ones right over the lower weighted ones.

Junwen Wu
10232001
02:21 AM ET (US)

Questions: 1.How to apply the distribution D in these algorithms proposed in Freund and Schapire's paper?
2.In Schapire and Singer's paper, in the generalized version of AdaBoost, sometimes h(y)'s range can be all real value, sometimes be restricted in [1,+1], sometimes be restricted in {1,0,+1}, so I'm confused by this. How to decide the range of h(y)? Will it influence the classfication result?
I'm really interested in section 4 of Freund and Schapire's paper. I think maybe it can also be used in SVM classifier(in the editingSVM) to reduce the number of support vectors, so that similarly the SVM algorithm can also be speed up.

Dave Kauchak
10222001
09:37 PM ET (US)

First, I'd like to make a couple of specific comments about the paper(s). I found the slight tweak in the M2 version where the algorithm outputs a simple set very interesting. Intuitively, this seems like a good idea and experimentally, it seems to work better. However, I did find the concept of pseudoloss a bit confusing (hopefully, this confusion will be resolved in the presentation :). I also have a few general comments about boosting in general that the paper somewhat hinted at. One of the amazing things about AdaBoost (and boosting in general) is that it is applicable in such a broad range of applications. And to boosting's credit, the performance is usually impressive. Another interesting thing that is somewhat discussed in the paper is that the contributions of boosting are still a bit of a mystery. I think more and more research is helping to isolate the exact effects of boosting, but I still don't think the whole picture is perfectly understood. Edited 10222001 09:40 PM


