|
|
| Who | When |
Messages | |
|
|
|
| Degui Zhi
|
1
|
 |
|
05-14-2002 01:53 PM ET (US)
|
|
Edited by author 05-14-2002 01:54 PM
Am I the first one to post? Wake up, guys! I think the double clustering idea in this paper is neat, which catch the duality of feature/class. I want to try to relate this paper with the papers I presented in the class. Is the Information Bottleneck is a similar construct as Markov Blanket, in the sense they both reduce the dimensionality of a problem? MB models dependencies between features and eliminate "redundant" ones, IB simply use classification cluster, the "bottleneck" as compact representation of the features. In another word, IB works on the power set space of features.
|
Kristin Branson
|
2
|
 |
|
05-14-2002 02:29 PM ET (US)
|
|
I agree that this paper presented an interesting idea. The information bottleneck method seems very reasonable. My biggest qualm with the paper is the lack of argument for double clustering. I guess the reason I see for it is that the high dimensionality of the data makes the data noisy. I don't understand why, when doing the word clustering, we would want to maximize the mutual information about the documents. It seems to me that an obvious choice would be to maximize the information about the words themselves, since after this step a lot of information about the words is lost, but we do not throw away the information about the documents. Perhaps something like PCA on the words could replace the word clustering?
As there was not really much emphasis on actual reasons the double IB clustering algorithm is better than other algorithms, there is the possibility that the other algorithms tested in the experimental section might perform better than double IB clustering in some test beds. The authors did do a thorough job of comparing many algorithms and presenting their results. Their choice of testbed seemed a little strange to me, since it seems that these unsupervised methods might not be aimed at text classification (look at the footnote of how much better NB did at the text classification task examined). However, the authors seemed to have thought about this problem a lot and made a valiant effort to explain their choices.
In terms of writing style, this paper could have benefited from more structure. The paper seems to dive right into the details of the algorithm before giving a more general overview. For example, the introduction goes into extreme detail about the algorithm and experiments without giving more of an overview. There seem to be only two levels in this paper -- the most general level that doesn't give much information, and the detailed level, which was a little frustrating to me because I couldn't tell, at the start, where the description was going. However, the paper managed to get its ideas through in the end.
|
| Dana Dahlstrom
|
3
|
 |
|
05-14-2002 04:15 PM ET (US)
|
|
Edited by author 05-14-2002 04:20 PM
As usual I think I lack the background to evaluate the relative
merits of the method presented here, but I concur with Kristin as
regards the writing style. Several technical issues made this
paper more difficult than necessary to read: (1) the inordinately
long, wandering paragraphs; (2) the incessant and inconsistent
use of italics, single quotes, and double quotes; and (3) the
grammar---there's even a subject-verb disagreement in the
abstract.
I also found it annoying the plots (on page 7) have different y-
axis ranges, and the x axes are labeled with what appear to be
titles instead of the meanings of the independent variables.
Also, (*smirk*) is the ``Kulback-Libeler'' divergence some
knock-off of the Kullback-Leibler divergence? This reminds me of
the ``Soney'' portable CD player my friend picked up in Taiwan.
|
Dave Kauchak
|
4
|
 |
|
05-14-2002 04:47 PM ET (US)
|
|
As Degui comments, I think the initial clustering into word clusters is a way to reduce the dimensionality of the data. I agree with Kristen in that their motivation for using clusters is not made very obvious, but I do think that it is to reduce the noisiness that might occur when using raw words. I think this sort of grouping is similar in effect to what is done with supervised learning methods which try and generalize from possibly noisy samples.
I thought the experiments showing that the double methods performed better than the single methods is a good start to show experimental motivation for using the word clusters over raw words. However, I was a bit disappointed with the experimental setup.
I'm glad the authors tried to get at a concrete data set, where performance could be measured easily. However, I think that the authors could have done a better job justifying their choices after they spent so much time explaining why previous setups were so bad.
The authors briefly mention stop-lists and word stemming. I think that they dismiss this idea too quickly. Why weren't these preprocessing methods used? Word stemming in particular seems like another method of reducing the word noise in the documents. The other methods may have performed better if word stemming or other equivalent processes were used.
|
| Greg Hamerly
|
5
|
 |
|
05-14-2002 05:12 PM ET (US)
|
|
I, like Dave, was wondering why the authors didn't remove stop words and use stemming. This seems to bias their method of word clustering against a "raw-word"-based approach like TF-IDF. However, they DID remove all non-alpha and numerics, without giving a reason. It seems simple and intuitive to replace non-alpha words with a NONALPHA token, and similarly for numeric data. Also, their pre-processing step of using only the top 2000 most informative words leaves me wondering why they used this step and not others.
It seems non-intuitive that they use the top 2000 words that are useful for discriminating between all pairs of documents in a data set, when the goal is to group documents together. However, in a setting without a priori labels, I guess this would be a good thing to do to reduce the number of words to deal with.
I would have liked to see a comparison with other methods which are standard. I'm not an expert in document clustering, so I don't know if the other methods (besides TF-IDF) are cutting-edge. I would also have liked to see (like Kristen) more of an explanation of the motivation of double-clustering. However, I think double-clustering is a fairly intuitive and interesting idea for dealing with high-dimensional, noisy data.
|
| sameer agarwal
|
6
|
 |
|
05-14-2002 06:20 PM ET (US)
|
|
I just got done skimming "The information bottleneck method", which is the original paper in which they actually propose the idea of the information bottleneck. I think the formulation given there is much more clearer, informative and cleaner.
The principal idea being that you just do not want compression of data, you want compression while taking care of the fidelity. The fidelity measurements require that you define a distortion function, which is a bit of a pain, since choosing a distortion function is equivalent to choosing apriori what features of X are more interesting than others.
In the original paper a very clean variational formulation is given to this problem in terms of mutual information as a measure of distortion.
Excellent idea and method, I just wish that we did the original paper instead of the one we are assigned to read.
|
| |
Messages 7-8 deleted by topic administrator 04-03-2005 07:52 PM |
|
|