Kristin Branson
|
2
|
 |
|
05-14-2002 02:29 PM ET (US)
|
|
I agree that this paper presented an interesting idea. The information bottleneck method seems very reasonable. My biggest qualm with the paper is the lack of argument for double clustering. I guess the reason I see for it is that the high dimensionality of the data makes the data noisy. I don't understand why, when doing the word clustering, we would want to maximize the mutual information about the documents. It seems to me that an obvious choice would be to maximize the information about the words themselves, since after this step a lot of information about the words is lost, but we do not throw away the information about the documents. Perhaps something like PCA on the words could replace the word clustering?
As there was not really much emphasis on actual reasons the double IB clustering algorithm is better than other algorithms, there is the possibility that the other algorithms tested in the experimental section might perform better than double IB clustering in some test beds. The authors did do a thorough job of comparing many algorithms and presenting their results. Their choice of testbed seemed a little strange to me, since it seems that these unsupervised methods might not be aimed at text classification (look at the footnote of how much better NB did at the text classification task examined). However, the authors seemed to have thought about this problem a lot and made a valiant effort to explain their choices.
In terms of writing style, this paper could have benefited from more structure. The paper seems to dive right into the details of the algorithm before giving a more general overview. For example, the introduction goes into extreme detail about the algorithm and experiments without giving more of an overview. There seem to be only two levels in this paper -- the most general level that doesn't give much information, and the detailed level, which was a little frustrating to me because I couldn't tell, at the start, where the description was going. However, the paper managed to get its ideas through in the end.
|