QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: Feature selection for high-dimensional genomic microarray data
Views: 373, Unique: 254 
Subscribers: 0
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
All messages    << 3-11  2-2 of 11  1-1 >>
About these ads
Who | When
Messagessort recent-top   
Post a new message
 
Aldebaro  2
04-30-2002 03:06 AM ET (US)
Edited by author 04-30-2002 05:14 AM
The paper is very interesting but I agree with Dave that the writing could be better.

Section 2.2 presents a quite confusing explanation about information gain. Let me try to make things worse...

Information gain is the same as mutual information. Given a bunch of examples (f,y), if one picks randomly an example and asks the value of label y, the uncertainty about the random variable (r.v.) Y can be measured by the entropy H(Y). If the value of the feature fi of instance f is revealed, it may potentially bring some useful information about Y or not. Given we know the values of the r.v. Fi, the remaining entropy of Y is H(Y/Fi). The information gain of Fi is
I(Y;Fi) = H(Y) - H(Y/Fi).

The bigger I(Y;Fi), the better, like in Fig 2(b). Given that H(Y) is fixed here, the goal is to find Fi that minimizes the uncertainty H(Y/Fi). The authors deal only with 2 classes: Y=true or Y=false. So, H(Y) is at most 1 bit. If the i-th feature can by itself determine the correspondent label y, then H(Y/Fi)=0 (no uncertainty left) and I(Y;Fi)=1 bit.

The authors present the information gain in a general setup, like when it's used in a decision tree where, in a given node, there are examples from C different classes (that correspond to a partition of the input space into C sets) and the value of feature fi will split these examples into K sets. In the paper, there are only C=2 classes. The "reference partition" is the division of the training data into 2 sets according to the 2 possible Leukemia labels (corresponding to trying to split always the root node of a tree).

Also, a minor detail: it doesn't sound right to say (last paragraph of 2.2) that quantization is required to calculate I(Y;Fi). There is a definition of entropy for continuous random variables in (Cover & Thomas, 91).

Too much about section 2.2... I am curious about the reason for having so many genes with "error" equal to 1 in Fig. 2(a). Maybe because I didn't find out how they really calculated Eq. (1) (integration of Gaussian pdf's?)

I think Fig. 2 simply shows the behavior of the ranking methodology, but there are no classifiers involved. There is a standard partition into training and test set and I guess that's what the authors used for Fig. 3. It's not fair to choose the number of features based on plots for the test set in Fig. 3, so the authors used cross-validation (leave-one-out) in Fig. 4 and 5.
RSS link What's this?
All messages    << 3-11  2-2 of 11  1-1 >>
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2008 Internicity Inc. All rights reserved.