QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: Document clustering using word clusters via the information bottleneck
Views: 607, Unique: 431 
Subscribers: 1
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
All messages    << 6-8  5-5 of 8  1-4 >>
About these ads
Who | When
Messagessort recent-top   
Post a new message
 
Greg Hamerly  5
05-14-2002 05:12 PM ET (US)
I, like Dave, was wondering why the authors didn't remove stop words and use stemming. This seems to bias their method of word clustering against a "raw-word"-based approach like TF-IDF. However, they DID remove all non-alpha and numerics, without giving a reason. It seems simple and intuitive to replace non-alpha words with a NONALPHA token, and similarly for numeric data. Also, their pre-processing step of using only the top 2000 most informative words leaves me wondering why they used this step and not others.

It seems non-intuitive that they use the top 2000 words that are useful for discriminating between all pairs of documents in a data set, when the goal is to group documents together. However, in a setting without a priori labels, I guess this would be a good thing to do to reduce the number of words to deal with.

I would have liked to see a comparison with other methods which are standard. I'm not an expert in document clustering, so I don't know if the other methods (besides TF-IDF) are cutting-edge. I would also have liked to see (like Kristen) more of an explanation of the motivation of double-clustering. However, I think double-clustering is a fairly intuitive and interesting idea for dealing with high-dimensional, noisy data.
RSS link What's this?
All messages    << 6-8  5-5 of 8  1-4 >>
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2008 Internicity Inc. All rights reserved.