| Greg Hamerly
|
5
|
 |
|
05-14-2002 05:12 PM ET (US)
|
|
I, like Dave, was wondering why the authors didn't remove stop words and use stemming. This seems to bias their method of word clustering against a "raw-word"-based approach like TF-IDF. However, they DID remove all non-alpha and numerics, without giving a reason. It seems simple and intuitive to replace non-alpha words with a NONALPHA token, and similarly for numeric data. Also, their pre-processing step of using only the top 2000 most informative words leaves me wondering why they used this step and not others.
It seems non-intuitive that they use the top 2000 words that are useful for discriminating between all pairs of documents in a data set, when the goal is to group documents together. However, in a setting without a priori labels, I guess this would be a good thing to do to reduce the number of words to deal with.
I would have liked to see a comparison with other methods which are standard. I'm not an expert in document clustering, so I don't know if the other methods (besides TF-IDF) are cutting-edge. I would also have liked to see (like Kristen) more of an explanation of the motivation of double-clustering. However, I think double-clustering is a fairly intuitive and interesting idea for dealing with high-dimensional, noisy data.
|