QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: A Comparison of Event Models for Naive Bayes Text Classification
Views: 1075, Unique: 592 
Subscribers: 0
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
About these ads
Who | When
Messagessort recent-bottom   
Post a new message
 
Dustin Boswell  5
05-13-2003 02:23 AM ET (US)
At the end of class, we brought up the possibility of using cross-entropy to do text classification. If anyone is curious, here's an interesting paper that builds n-grams of the LETTERS (instead of the words):
http://citeseer.nj.nec.com/teahan00text.html
Amoung other tasks, they mention results comparable to the ones we saw in the presentation for the NewsGroup data. But this method has the benefit of not having to know the optimal vocabulary size in advance.
Dustin Boswell  4
05-13-2003 01:06 AM ET (US)
Is not stemming really what's violating the naive bayes assumption? To me, it seems like independence of words is a poor approximation to begin with, and the stemming issue wouldn't make a huge difference either way. But from a practical point of view, it's nicer to have a system that just takes training data "as is" (without extra preprocessing).

Does anyone else think the results aren't all that conclusive? They don't have that much training/testing data for starters. And the fact that they show test results over a range of vocab sizes makes me wonder "how would I know what vocab size to use in practice?". For some of the test results (Newsgroups and WebKB) there doesn't seem to be much of a difference between multinomial and bernoulli.
Coleman Mosley  3
05-12-2003 07:32 PM ET (US)
Edited by author 05-12-2003 07:33 PM
I wonder what N_subscript"is" represents versus the Nit Andrew mentioned in his query. Is this a misprint or across all the possible occurences?
Are the factorial part of their probability distributions? I don't have it directly infront of me.
Andrew Smith  2
05-12-2003 05:52 PM ET (US)
Neil, they use a many-to-one correspondence of mixture components to class labels in the next paper "Text Classification from Labeled and Unlabeled Documents using EM." Basically, the mixture components correspond to sub-topics whereas the class labels correspond to more general topics (ie baseball vs. sports). look on Page 14 for their example.

Does anyone know the intuition behind the |di|! and Nit! in equation (5)? I usually see the probability of a document given a class as equation (5) without those two quantites.
Neil Jones  1
05-12-2003 05:17 PM ET (US)
I'm curious why they don't use stemming in their text processing. By not stemming they're violating the naive bayes assumption (independence of words). It seems trivial to apply a stemming algorithm, so there must be some (unstated) reason for avoiding it.

Also confusing to me is the sort of vague statement that one *could* make it so that there is not a one-to-one correspondence between class labels and the mixture model components. If the two don't correspond, what is the mixture? Could parts of one document belong to two different classes? I can't seem to get my head around what this represents.
RSS link What's this?
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2006 Internicity Inc. All rights reserved.