QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: On Discriminative vs. Generative classifiers
Views: 366, Unique: 230 
Subscribers: 0
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
About these ads
Who | When
Messagessort recent-bottom   
Post a new message
 
Vladimir Bulatov  16
09-10-2002 05:38 AM ET (US)
Edited by author 09-10-2002 05:41 AM
I beg my pardon for interfering into the discussion, but my problem is at least semi-relevant to the topic.
My scientific work (in the area of marketing | consumer behavior) is stagnant because I cannot find quick and easy explanation of why/how Bayesian and MCMC methods advance data analysis. I know that the approach has become extremely popular and offer higher precision than traditional stat. analysis.
So the problem: can anyone address me to the sources of information than will give in simple words the explanation of the tough stuff (i.e. how Bayesian theory is build; how it is incoroporated into data analysis, and - how MCMC (and particulary Gibbs approach) helps).

Thank you.

If you can, please, cc the reply to: bbe@hologr.com
Greg Hamerly  15
04-11-2002 10:20 PM ET (US)
Replying to bianca's post in /m11, I missed the fact that n is the dimension. My mistake. However, my point remains that m is simply a training set size that the authors choose, which may be as large as the size of the data set (which I was incorrectly taking to be n), but it also may be smaller.
Joe Drish  14
04-11-2002 06:22 PM ET (US)
Regarding domain specific vocabulary, I don't know if that's as necessary as saying very early that naive Bayes is the generative method and logistic regression the discriminative. I suppose one-line explanations would help, but not a formal explanation for every concept.
Gyozo Gidofalvi  13
04-11-2002 05:33 PM ET (US)
I agree with earlier comments that the experiments performed were not sufficient, were somewhat confusing, furthermore did not entirely support the theoretical results derived in the paper. As Dave noted earlier only about 50% of the experiments actually conformed to the theoretical results.

I was not entirely sure about how general the results were. I think generative and discriminative classifiers chosen in the paper make assumptions about the distribution of samples that are not made by other classifiers in general, hence the generality of the proofs is questionable.
sameer agarwal  12
04-11-2002 04:26 PM ET (US)
Here is some more on the VC dimension.
 
The VC dimension is used as measure of the "capacity" of a classifer learning algorithm.
By capacity we mean, how many different arrangements of a number of points can be classifed correctly by a classifer learned using the the algorithm.

A simple example is an algorithm which learns to draw a hyperplane in n-dimensions. The analytical result is that the VC dimension of the set of hyperplanes in n dimensions is n+1. Consider the case of n=2, the 2 dimensional plane. Take any three points and given then abritray biniary labels. It is easy to see that its possible to draw a line such that points with the same labels lie on the same side of the line.

However if we take four points. Consider an arrange of these points on the vertices of a square, with the points lying on the positive diagonal labeled 1 and the points lying on the negative diagonal labeled 0. In this case we cannot draw any line which will seperate the two classes into distinct regions of the space.

Hence in 2 dimensions the class of straight lines has a VC dimenions of 3. Notice that this does not mean that there do not exist arrangements of points > 3 which cannot be seperated using a straight line. All that it means is that 3 is the maximum number for which "all" possible labelings of the points can be seperated.

Now for any learning algorithm, its VC dimension is defined to be the VC dimension of the set of hypothesis (boundaries) that it can learn.

Finally, the reason why there is so much talk about the VC dimensions of learning algorithms is because of the seminal work of Vapnik and Chervonenkis on bounding the test error of a classifier in terms of its training error, and its VC dimension.

approximately the following inequality holds :

E_test <= E_train + O(sqrt(V))

where V is the VC dimension of the learning algorithm.
So even though you might be able to get zero training error by choosing a large and complicated enough learning algorithm (high VC dimension) the test error need not necessarily be low.
Bianca Zadrozny  11
04-11-2002 04:11 PM ET (US)
I do not understand why Greg said: "Certainly m<n". The results in the paper use the O notation, which means that we may need up to m=a*log(n)+b to reach the asymptotic error for naive Bayes and m=a*n + b for logistic regression. Since we do not know a and b, we cannot assume that m is at most n.
For example, for the adult dataset n=14, but the error is still being reduced after m=14 examples.
Degui Zhi  10
04-11-2002 03:59 PM ET (US)
following the lines of Bret and Dana's comments:
I think the author implicitly assume that readers have background on PAC (Probably Approximately Correct)learning framework.

I think this paper can be a good paper without the experimental section, which is weak and seemingly coming out of a hurry to catch the deadline of a conference (actally I seen typos in the paper, too).
Yohan Kim  9
04-11-2002 02:18 PM ET (US)
iid = independent and identical distribution
An iid example would be the following:
random variables X1,X2,...Xn, each has its own density function f(Xi), where i=1,2...n. Their joint density function is the product of f(Xi). And f(Xi)'s are identical, say gaussian with some mean and variance.

VC = Vapnik-Chervonenkis

"The VC dimension measures the complexity of the hypothesis space H, not by the number of distinct hypotheses |H|, but instead by the number of distinct instances from X than can be completely discriminated using H." Machine Learning, Mitchell 1997.
Dana Dahlstrom  8
04-11-2002 01:54 PM ET (US)
What is an iid example?

``VC'' means venture capital, right? Or is it Viet Cong?
Bret Ehlert  7
04-11-2002 04:17 AM ET (US)
Edited by author 04-11-2002 04:18 AM
I agree with Dave's comment on the lack of explanation of domain specific vocabulary. I was particularly hung up on the paper's use of the term "VC dimension", which the authors do not define (and of which I was not familiar). A one sentence definition of the Vapnik-Chervonenkis dimension would have been quite helpful. I don't think the authors were very concerned with making this paper accessible to readers without a strong background in statistical learning theory.
Greg HamerlyPerson was signed in when posted  6
04-11-2002 03:20 AM ET (US)
I liked this paper's clear introduction to the task at hand.

I believe that the m that people are discussing is simply the sample training size, which can vary according to the experimenter's will. Certainly m < n, but other than that the authors could have limited m so that the graphs showed more detail. Of course, this leaves the reader to wonder what happens if m could get much bigger but they don't show it on the graphs.

They also make a case in lemma 3 and its proof that m = O(log(n)) for a generative model, which is probably another reason to justify limiting m in their experiments.
Dave KauchakPerson was signed in when posted  5
04-10-2002 03:37 PM ET (US)
Although I thought that the main theoretical proof section was informatative, I felt the paper was a bit lacking in the introductory and experimental sections. I would liked to have seen a bit better introduction, particularly with the extensive use of domain specific vocabulary.

I found the results in the experimental section compelling, but not as compelling as the authors. I understand that the main contribution of this paper is of a more theoretical one, but if experiments are to be included, they should be analyzed further. It appears to me that only half of the experiments seem to show the conclusions of the authors. To simply brush this off by saying "m presumably cannot grow large enough..." is not sufficient. Also, a detailed analysis of the figures and experiences running the tests would have made the experimental section much more useful to the readers.
Aldebaro  4
04-10-2002 03:16 PM ET (US)
Yes, Fig. 1 is not completely clear. I think m should be the raw number of training examples, to be consistent with the notation. I checked http://www1.ics.uci.edu/~mlearn/MLSummary.html, where one can find that for "lymphography" there are 148 examples, which is consistent with Fig. 1, but "adult" has 48,842 examples (some with missing attributes).
Joe Drish  3
04-10-2002 06:12 AM ET (US)
Edited by author 04-10-2002 06:13 AM
It's amazing how well the theoretical results compare to the experimental results. It would have been convenient to provide the sample sizes of the datasets (correct me if I'm wrong, but I don't think the numbers corresponding to m on the horizontal axes are raw numbers), but I suppose those can be determined by looking them up at the UCI repository. Second, I'm curious to know why logistic regression has not been praised or used much in commercial data mining applications, given the conventional folk wisdom "articulated" by Vapnick as discussed in the paper's introduction, and also given the results presented here. Naive Bayes, not logistic regression, is highly touted by many data mining experts.

Elkan claims in "Boosting and Naive Bayesian Learning", 1997, that naive Bayesian classification is just a nonparametric, nonlinear extension of logistic regression. Is the analysis conducted by Ng and Jordan even necessary, given that naive Bayes is just an extension of logistic regression? The main difference between the two methods is that naive Bayes requires an intermediate step, but is this the only high level conceptual difference?

I take issue with the way the results are expressed - that the generative model approaches its asymptotic error [faster] than the discriminative model. It does so, but naive Bayes does start out initially with a much lower error. Empirically, the slopes of the curves are very much alike as the the training size (m) is increased. They do say that this is what is happening, but not clearly.
Aldebaro  2
04-08-2002 09:14 PM ET (US)
I sent to Thomas a ps file obtained from:
http://www.cs.berkeley.edu/~ang/
Thomas  1
04-03-2002 09:36 PM ET (US)
I downloaded the ps file and tried to unzip it with WinZip, but it complained it was corrupt. Could any one forward me a copy of this paper? My email is tzheng@qualcomm.com

Thanks
RSS link What's this?
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2006 Internicity Inc. All rights reserved.