QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: Feature selection for high-dimensional genomic microarray data
Views: 357, Unique: 246 
Subscribers: 0
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
All messages            1-11 of 11        
About these ads
Who | When
Messagessort recent-top   
Post a new message
 
Dave KauchakPerson was signed in when posted  1
04-29-2002 06:07 PM ET (US)
I am extremely impressed at authors' ability to cram so much information into such a short paper. However, I think due to it's density, there were quite a few details that were assumed to be known or simply left out. So, here are a list of questions/comments that I had:

In section 2.2, what is a reference partition?

Why did the authors decide to use k=3 for the k-nn classifier?

For the unconditional mixture modeling, do you simply threshold the mixture overlap probability? If so, how is this threshold chosen? The authors seem to ignor this in the description and the experimental section.

I was a bit disappointed with the experimental section. I found it a bit difficult at times to understand exactly what was happening. It doesn't appear to me that the authors used cross-validation in their experiments. If they didn't, why didn't they? In the first set of experiments seen in Figure 2, are the feature selection methods applied sequentially or are those results of the algorithms run independently? The experiment in section 5.2 is a bit confusing for me. Although I think it is interesting to show how the classifiers perform with a consistent feature set, I think it also would have been interesting to see how the classifiers interact with the classifier dependent feature selection algorithms.

Finally, given that one of the motivations for applying filtering methods initially to narrow the feature selection problem is to minimize computation time, I think the paper should have presented some notion of computational complexity improvement even if it was only in the form of experimental observation.

Sorry for the long post... I had a few questions :)
Aldebaro  2
04-30-2002 03:06 AM ET (US)
Edited by author 04-30-2002 05:14 AM
The paper is very interesting but I agree with Dave that the writing could be better.

Section 2.2 presents a quite confusing explanation about information gain. Let me try to make things worse...

Information gain is the same as mutual information. Given a bunch of examples (f,y), if one picks randomly an example and asks the value of label y, the uncertainty about the random variable (r.v.) Y can be measured by the entropy H(Y). If the value of the feature fi of instance f is revealed, it may potentially bring some useful information about Y or not. Given we know the values of the r.v. Fi, the remaining entropy of Y is H(Y/Fi). The information gain of Fi is
I(Y;Fi) = H(Y) - H(Y/Fi).

The bigger I(Y;Fi), the better, like in Fig 2(b). Given that H(Y) is fixed here, the goal is to find Fi that minimizes the uncertainty H(Y/Fi). The authors deal only with 2 classes: Y=true or Y=false. So, H(Y) is at most 1 bit. If the i-th feature can by itself determine the correspondent label y, then H(Y/Fi)=0 (no uncertainty left) and I(Y;Fi)=1 bit.

The authors present the information gain in a general setup, like when it's used in a decision tree where, in a given node, there are examples from C different classes (that correspond to a partition of the input space into C sets) and the value of feature fi will split these examples into K sets. In the paper, there are only C=2 classes. The "reference partition" is the division of the training data into 2 sets according to the 2 possible Leukemia labels (corresponding to trying to split always the root node of a tree).

Also, a minor detail: it doesn't sound right to say (last paragraph of 2.2) that quantization is required to calculate I(Y;Fi). There is a definition of entropy for continuous random variables in (Cover & Thomas, 91).

Too much about section 2.2... I am curious about the reason for having so many genes with "error" equal to 1 in Fig. 2(a). Maybe because I didn't find out how they really calculated Eq. (1) (integration of Gaussian pdf's?)

I think Fig. 2 simply shows the behavior of the ranking methodology, but there are no classifiers involved. There is a standard partition into training and test set and I guess that's what the authors used for Fig. 3. It's not fair to choose the number of features based on plots for the test set in Fig. 3, so the authors used cross-validation (leave-one-out) in Fig. 4 and 5.
Kristin Branson  3
04-30-2002 03:06 PM ET (US)
Dave, from my understanding, a reference partition is just a partition of the data into its classes. So S_1 is the data with label 1 and S_C is the data with label C.

I think the authors did use cross-validation of the training data to determine the optimal number of features to select, as shown in Figure 4. In terms of using cross-validation to test the algorithm, I can only assume that this is what is used. In particular, cross-validation is the only way to go if the number of training samples is limited, and the precision of their error measures suggest that they tested on a large number of samples. You can tell that they didnt test on the training set because of the dotted line graphs in Figure 3.

I am looking forward to today's presentation to clarify the different feature selection algorithms used. I'm not familiar with these algorithms, and am interested in learning about them. I'm not sure how creative the authors were in deciding to pile all these algorithms on top of each other (seems a bit ad hoc), but their justifications for the order in which they were applied is somewhat satisfying (at least for the Markov Blanket Filtering). I agree with Dave that the paper was very dense and full of information, making it hard for a novice like me to understand the algorithms.

I am confused about the specifics of the gene application, and was wondering if someone could explain it to me. From the introduction, I am guessing that a feature is a gene, and that there are 7130 genes collected from a single sample. What is a sample? Is it one human, for example? And what is an expression level?

Looking at Figure 3, it seems weird to me that under 10 featuers give optimal performance for logistic regression, while over 40 are required for KNN. Is there a property of KNN/logistic regression that explains this?

Overall, I think this paper is a good paper to present for this class, as it goes over a number of interesting algorithms. I think the paper would have benefited by losing the section on regularization methods, as it doesn't really fit with the rest of the paper. Then, there would have been more room for explanation.

I thought the experimental section was actually pretty good -- it had all the comparisons I wanted to see, though I agree that it was a bit vague about the details of the experiments. I also like that the authors tried different classification algorithms, as the interaction between classification algorithm and feature selection method may be hard to predict.
Gyozo Gidofalvi  4
04-30-2002 03:08 PM ET (US)
I also liked the ideas presented in the paper although i think there were too many. To me it seemed that there was not a single idea on which the paper was focusing.

Feature selection in itself is a very important and interesting task. I think that the paper showed one particular way of performing feature selection.

I think the general result that "feature selection methods can be applied to classification problems, [where there are only a few data points and those data points are high dimensional]" is not surprising at all.

Also, i found the comparison to classification results using a random subset of features somewhat unfair. I think it would be interesting so see how classifiers such as SVMs would perform on the high dimensional data set, or on the dataset, which is obtained after applying traditional dimensionality reduction techniques.
Greg HamerlyPerson was signed in when posted  5
04-30-2002 03:22 PM ET (US)
This was a good paper, and well-written. I really liked the idea of section 2.1, where they define a probability of mixture overlap (and thus of discriminative power).

I think figure 2 is confusing and misleading. My understanding is that the authors would choose features first based on epsilon [figure 2(a)], then from that subset they would use information gain [2(b)], then finally from that subset use the Markov blanket. However, in 2(a) they only show genes which have epsilon < 0.5, which is about 4500 genes. But then in 2(b) they show information gain for apparently all of the data. From 2(b) to 2(c) they seem to be consistent and take only 360 genes filtered by information gain.

I agree with Victor about the comparisons with other methods. I am not familiar with what is state of the art for gene expression classification, but perhaps SVMs or some other approach would be nice to see.

Overall, good paper.
Eugene Ke  6
04-30-2002 04:40 PM ET (US)
Greg,

Here is a web site with many links to papers that use SVMs in Computational Biology. http://www.support-vector.net/bioinformatics.html

There is a recent paper that is very similar to Degui's and one he recommened to me, Gene Selection for Cancer Classification using Support Vector Machines, I. Guyon, J. Weston, S. Barnhill and V. Vapnik, Machine Learning 46(1/3): 389-422, January 2002.

The current link on that page is down, but if you post your email address and I send you the pdf.
Eugene Ke  7
04-30-2002 05:04 PM ET (US)
Kristin,

Each feature is a gene, and each patient is a sample. If you have time, reference the Golub paper at http://www.sciencemag.org/cgi/content/full/286/5439/531. It has a very good description of the experimental data.

Expression level is more hard to explain, but I'm sure Degui will go into DNA Microarrays into much more detail. Sorry, but he is much more knowledgable than I am.
Eric Wiewiora  8
04-30-2002 05:30 PM ET (US)
To answer Kristin's question about the samples, I looked at the introduction to section five. Sample were taken from 78 leukemia patents, and the task was to classify the leukemia as type I or type II. Apparently this classification can be made by the symptoms at some point in the disease, but the task is to identify the type earlier based on gene expression. Gene expression is a measure of whether a certian segment of DNA is currently being used in the process of making proteins. Most of a cell's DNA is dormant at any given time. I do not know how they check for gene expression, but I suspect it is measured by the RNA in the cells. RNA is a modified version of DNA that is directly involved in protein construction.

Anyone in bioinformatics care to confirm/deny my analysis?
Aldebaro  9
04-30-2002 05:54 PM ET (US)
About using cross-validation or a test set. I think they used both. When describing Figure 3, they mention "in actual diagnostic practice we do not have the test set available, so these number are optimistic". That's why I think Fig. 3 uses the standard partition of training and test set mentioned in the beginning of section 5, while Figs. 4 and 5 use leave-one-out.
Aldebaro  10
04-30-2002 08:32 PM ET (US)
Eugene, my email is a.klautau
at (hope the spammers don't parse it):
ieee.org

It would be nice if you could send me the paper and I hope you check this page :)
Eugene Ke  11
05-01-2002 01:21 AM ET (US)
Eric,

Your analysis is very good and largely correct. If you reference the Golub paper, you'll understand why accurate classification is important. The short summary is that the two types of leukemia, ALL and AML, have very similar physiological symptoms. However, while the symptoms appear the same, they are associated with different cells, and the methods of treating them dissimilar. As stated in Golub, "Although remissions [cure] can be achieved using ALL threapy for AML (and vice versa), cure rates are markedly dimished, and unwarranted toxicites are encountered."

Unforunately, in order to correctly classify the two types you have to run all kinds of test and rely on expert judgement. So, a quick and dirty initial test would be ideal in helping with classification (less time waiting, more effective treatment, etc...). That, and a more quantative measure would be nice, as "expert" opinion can be terribly noisy.

There are many examples similar to this one, and if you're interested in this kind of work, the Pharmacology Department here on campus is great. Just ask around about Pharmacogenetics/Pharmacogenomics.

To measure gene expression in a high-throughput way, that is where the DNA microarrays use. Basically, they use lithography techniques to spot a chip with DNA/RNA probes. They extract all the RNA from a cell, and dump it onto the chip. If the RNA binds to a probe, exactly which protein it would express. Of course, that is very watered down and if you want more detail, reference the Shalon paper or talk to Dr. Michael Heller in the Bioengineering department. Lot of active research too, as the technique is very very new.

Little more background, according to the Central Dogma, DNA is converted to mRNA (messenger RNA) in the nucleus, mRNA travels outside the nucleus, where it is translated by the ribosomes which then construct proteins. For more information, see Alberts, the Cell.
RSS link What's this?
All messages            1-11 of 11        
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2008 Internicity Inc. All rights reserved.