| Who | When |
Messages | |
(not accepting new messages)
|
|
| |
Messages 161-159 deleted by topic administrator 07-15-2009 02:08 AM |
Jerry Fu
|
158
|
 |
|
12-14-2007 06:57 AM ET (US)
|
|
I don't know if anyone is still reading this, but I discovered an interesting "trick" in MATLAB. I'm sure most of you found that computing the feature function values took quite a while. So I decided to try splitting up the training set, so that I calculated all the features for 500 examples at a time (passing only the 500 examples into the function), and just appended the calculated arrays. This was blazing fast compared to calculating the entire set at once...so that's something to try if you have this problem in the future.
|
| Charles Elkan
|
157
|
 |
|
12-13-2007 12:59 PM ET (US)
|
|
I did not intend to advocate Matlab strongly over R. Personally I happen to be more accustomed to Matlab. AFAIK Matlab has a wider range of users and of libraries for all of science and engineering, including machine learning. On the ther hand R is more elegant and more widely used by the best researchers in statistics. I'm sure there are other important differences also.
One important question is which handles large datasets better. I don't know the answer to that.
On Thu, 13 Dec 2007, QT - Klinton Bicknell wrote:
> < replied-to message removed by QT >
|
| Klinton Bicknell
|
156
|
 |
|
12-13-2007 12:41 PM ET (US)
|
|
At the beginning of the quarter, you said in the response to one of the messages on this board that R and Matlab had complementary advantages, and it was good to know both. Yet, I think all the projects we've done have strongly recommended using matlab. I was just curious what you saw as the relative strengths and weaknesses of the two. (Obviously, this is a low-priority question, though =)
|
| Charles Elkan
|
155
|
 |
|
12-12-2007 09:57 AM ET (US)
|
|
/m154: Yes, Viterbi assumes the weights are given. However, these weights are updated by each step of the perceptron. In each epoch through the training data, when you run Viterbi on example i+1, you need to use the weights updated based on example i. Running Viterbi on every example only at the beginning of each epoch might conceivably work. This would be an interesting experiment to try. To get a useful answer, you would have to adjust the learning rate separately for each method.
|
Jerry Fu
|
154
|
 |
|
12-12-2007 03:32 AM ET (US)
|
|
I am a bit confused now. Aren't the weights set before you ever run Viterbi? so you would go through the entire dataset using perceptron to train weights, then use Viterbi to determine the labellings?
|
| Charles Elkan
|
153
|
 |
|
12-12-2007 02:25 AM ET (US)
|
|
/m152: You are right, there are no thresholds in the Viterbi algorithm. Generally you might find some way of tuning the training and/or prediction algorithm to produce overall more or fewer hyphens. But I can't say I know a good way to do this. More generally, per-letter accuracy, per-word accuracy, and log-likelihood are all different objective functions. A research question is how to tune CRF-type methods in order to maximize (on test data!) whatever objectove makes most sense for the application.
|
| Klinton Bicknell
|
152
|
 |
|
12-12-2007 01:50 AM ET (US)
|
|
I don't understand the question about tuning numerical thresholds for predictions. I'm not sure what part of Viterbi could have a threshold.
|
| Charles Elkan
|
151
|
 |
|
12-12-2007 01:38 AM ET (US)
|
|
for time complexity reasons, we were computing the Viterbi algorithm for all the words together and storing it in a matrix of size O(m*n*numwords).
If you are using the perceptron method (or stochastic gradient) this is not a good idea, because the weight vector changes after processing each word. So you have to run Viterbi each word at a time.
we have to compute four more f matrices (one for 00, 01, 10, 11) which are each (numwords*lengthOfLongestWord*numFeatures) in size. These matrices can be stored as 'sparse' in Matlab, because most feature-functions are 0, for every position of every word.
|
| Kristen Jaskie
|
150
|
 |
|
12-12-2007 01:31 AM ET (US)
|
|
The Viterbi algorithm only needs O(m*n) space per word, but for time complexity reasons, we were computing the Viterbi algorithm for all the words together and storing it in a matrix of size O(m*n*numwords). Also, it's necessary to compute g which is of size(numwords*lengthOfLongestWord*4) and to do that, we have to compute four more f matrices (one for 00, 01, 10, 11) which are each (numwords*lengthOfLongestWord*numFeatures) in size. All together there are ton of data structures that have to be stored even to compute U. Unless we are totally confused and doing something wrong...
|
| Charles Elkan
|
149
|
 |
|
12-12-2007 01:30 AM ET (US)
|
|
What would be an "acceptable" number of feature functions to have? I'm asking because I'm not sure we will have time to put in thousands or even hundreds of functions for the final project. We're currently working with a small number of feature functions as we develop and test our algorithms, but we expect to expand beyond those that we currently have, although it might be difficult for us to get more than 50 or so. Another concern here is how well the project would work without that many feature functions, and I guess I'll find that out once we're able to start testing this.
I think for good performance you need at least a few hundred feature functions. However, I will be happy if you show that I am wrong!
This project is an educational experience, so if you do everything well except for some reason you have under 100 feature functions, you will still get a very good score. In your report, do explain briefly why you have not many feature functions, and what the bottleneck(s) is/are preventing you from having more.
If you do have 100s or 1000s of feature functions, then experiments using different subsets of these will be very interesting.
|
| Charles Elkan
|
148
|
 |
|
12-12-2007 01:24 AM ET (US)
|
|
Edited by author 12-12-2007 01:25 AM
/m144 answer: I am surprised because the Viterbi algorithm only needs O(m*n) space where m is the number of tags and n is the length of the word. With m <= 4 and n <= 10 this should not be a problem! If you can run ok with training based on 100 words but not 500, it seems you are not freeing up memory after each processing each word. Probably you need to find and fix this memory leak. One way to clean up your code to reduce memory usage is to encapsulate the body of an outer loop into a function. This can help because all memory used by local variables of the function is released every time you return from the function. Have you used the Matlab profiler to see where your code is slow? That may be where the memory leak is. Unfortunately the profiler does not directly tell you about memory usage. You may find useful hints here: http://www.mathworks.com/support/tech-notes/1100/1106.html
|
| Kristen Jaskie
|
147
|
 |
|
12-12-2007 01:20 AM ET (US)
|
|
We were using over a thousand but we cut out the ones that didn't seem very useful during preliminary testing so now we are only using 740.
|
| Jerry Fu
|
146
|
 |
|
12-12-2007 01:17 AM ET (US)
|
|
how many feature functions are you using?
|
| Sean Park
|
145
|
 |
|
12-11-2007 11:51 PM ET (US)
|
|
Same thing is happening to us. Do not really know what to do as well.
|
| Kristen Jaskie
|
144
|
 |
|
12-11-2007 11:45 PM ET (US)
|
|
My partner and I have been working super hard to finish this project early because we both have other deadlines and exams this week. We finished the code (using the Viterbi algorithm) this weekend but have spent the two days since trying to run the experiments. Our problem is that we keep running out of memory (even on my partner's brand new laptop optimized for performance with enhanced virtual memory in the control panel). We tried cutting down our number of feature functions to a minimum and performing garbage collection wherever possible. But while we can run the code just fine if we use 1% of the data for training, attempting to use even 5% causes the program to fail due to lack of memory. We have spent WAY too much time on this problem and are unsure how to proceed. We obveously cannot do 5-way crossvalidation. Should we just attempt 100-way crossvalidation (using 1% at a time to train and the rest to test?) We started early but we're down to the nub now and don't know what to do. Any suggestions?
|
| Charles Elkan
|
143
|
 |
|
12-11-2007 03:11 PM ET (US)
|
|
Some additional notes on what to put in your report
Make sure you yourself, and the reader, are convinced that your algorithms and your implementation are correct. We want to be able to believe and trust your experimental results.
In the report, make it very clear exactly what feature-functins you use. For each different type of feature-function, explain it explicitly and say how many feature-functions there are of this type.
Examine and discuss your final trained CRF model. Tabulate which feature-functions have the largest learned weights (positive or negative). Use your intuition to explain, in retrospect, why these feature-functions have the most predictive power.
|
| Charles Elkan
|
142
|
 |
|
12-11-2007 12:15 AM ET (US)
|
|
a feature function that examines only vowel consonant patterns and hyphenates a particular letter in that sequence does not change based on the previous tag. so when performing the Viterbi algorithm, the value of this feature function would be multiplied by its weight and added in , correct?
Yes.
If we have a feature function that examines the previous tag, we can precalculate all possible tag combinations in preprocessing, and then operate on these vectors instead of calculating these on the fly. Yes.
|
| Jerry Fu
|
141
|
 |
|
12-10-2007 11:30 PM ET (US)
|
|
I have a questions about feature functions, specifically how the lecture notes example pertains to our project. the function U(k,v) maximizes over the tags y at each step, which are either 1 or 0 in our project. However, a feature function that examines only vowel consonant patterns and hyphenates a particular letter in that sequence does not change based on the previous tag. so when performing the Viterbi algorithm, the value of this feature function would be multiplied by its weight and added in , correct?
If we have a feature function that examines the previous tag, we can precalculate all possible tag combinations in preprocessing, and then operate on these vectors instead of calculating these on the fly.
|
| Charles Elkan
|
140
|
 |
|
12-09-2007 05:25 PM ET (US)
|
|
Reminder: The final exam will be on Thursday at 3pm, in WLH 2208, which is the same room that lectures were in.
|
| Charles Elkan
|
139
|
 |
|
12-09-2007 12:05 PM ET (US)
|
|
Project simplifications
Here are some suggestions and simplifications that should make the project easier.
(1) Implement perceptron training first. With this method, you do not need to compute the partition function, and you do not need to compute derivatives. All you need is the argmax.
(2) If you assume that no hyphen is allowed after the first letter or before the last letter, then you don't need START and STOP tags, so the cardinality of the tag set is just 2. With this cardinality, you can compute the argmax, Z, and derivatives all by brute force (i.e. enumerating all possible tag sequences). Implement this first and use it to do the project. Then, if you have time, use it to check the correctness of your code that uses sophisticated algorithms.
(3) Your priorities should be clearness, correctness, and usefulness. A clear, correct, useful method is far more desirable than a fancy method whose correctness is not clear and whose results are questionable.
(4) There is an 80-20 rule in everything. Best performance might come from 10,000 feature-functions but you can get almost as good performance with many fewer. Start with few enough so that your code is fast enough that you can debug it easily.
(5) Perception is reality. First impressions matter. Presentation counts. A good report about a good method is better than a bad report about a method that might be very good.
(6) Many problems are too hard to solve, except with simplifications. Having the judgment to invent potential simplifications, and then to select reasonable ones, is a very important skill.
In summary, I will be happy to give full marks to a good report about a method that clearly does solve the problem, even if the method itself is simplified. Specifically, you can get full marks based on just just one training method.
|
| Charles Elkan
|
138
|
 |
|
12-08-2007 06:33 PM ET (US)
|
|
I appreciate that the last project is the most complex one, and that many people have multiple other assignments due also. I will certainly be generous in scoring. Let me repeat also that scoring is not grading, so lowish scores can still lead to a good grade. The projects and exams are challenging!
I have been covering log-linear models and CRFs since November 27. I understand that many people may not have been able to focus on the project until now. I will try to be very available to answer questions on this message board.
|
| Jostein Oeysaed
|
137
|
 |
|
12-08-2007 04:57 PM ET (US)
|
|
Hi Charles,
I'm feeling kind of frustrated because this last project seems much more difficult than the previous ones, and you didn't go over the stuff we needed to know until Thursday. My partner and I started early and were trying to do it before, but haven't made any progress till now since you hadn't gone over what we needed to know. We still have a LONG way to go (it's pretty confusing stuff) and now we only have a couple of days to do the entire thing in. The problem is that I have several other finals and projects next week, so I just don't know if we'll have time to do a good job AND to study for my other classes.
I think the week we had off due to the fires messed up your schedule, but I don't think it's fair for you to make us work like this during finals week. We really didn't have the material that we needed to do the project until the very last day of class.
It would help if you could be a little bit easier in grading this project, due to the lack of time we have to do it (like how you "graded" the first project :)
|
| Charles Elkan
|
136
|
 |
|
12-07-2007 01:38 AM ET (US)
|
|
|
| Charles Elkan
|
135
|
 |
|
12-05-2007 10:24 PM ET (US)
|
|
Edited by author 12-05-2007 10:25 PM
how to represent feature-functions in matlab
Given fixed xbar and ybar, each F_i value is a number. So you can represent all F_i as one vector for xbar and ybar. This vector is sparse since F_i(xbar,ybar) = 0 for most i. The F_i vector can be computed once and for all for each training example, before you start training.
How you compute the F_i vectors, i.e. how you go from definitions like f(y_i-1, y_i, xbar, i) = I(y_i = 1)*I(<x_i-1 x_i x_i+1 x_i+1> = vowel-consonant-vowel-consonant) to the numerical values, is a separate question. You may want to do this preprocessing in a scripting language.
|
| Charles Elkan
|
134
|
 |
|
12-05-2007 10:12 PM ET (US)
|
|
m is "... the cardinality of the set of labels." So, for this project it should be 2 since the each label of a character is binary, correct?. This creates 2x2 matrices for the multiplications
Yes, because there are just two labels: 0 for no hyphen, and 1 for hyphen. If you also have labels "begin" and "end" then m=4.
I don't quite think my understanding of this is correct since for a feature function with different arity like g_t(y_{i-1}, y_i, y_{i+1}), this doesn't work. How are different arity feature functions dealt with, and what is the correct value for m?
The arity has nothing to do with m, the number of different label values possible at a single position. The restriction to arity = 2 is fundamental to the CRFs we are studying. The reason is precisely that the Viterbi and matrix algorithms only apply when arity = 2.
|
| Kristen Jaskie
|
133
|
 |
|
12-05-2007 05:32 PM ET (US)
|
|
Deleted by author 12-05-2007 08:19 PM
|
| Stefano
|
132
|
 |
|
12-05-2007 03:14 PM ET (US)
|
|
I have a question about computing the partition function.
The matrix multiplications of M_t(v,w), are for n different m-x-m matrices. m is "... the cardinality of the set of labels." So, for this project it should be 2 since the each label of a character is binary, correct?. This creates 2x2 matrices for the multiplications, which seem to work out for the definition as g_t(y_{i-1},y_i).
I don't quite think my understanding of this is correct since for a feature function with different arity like g_t(y_{i-1}, y_i, y_{i+1}), this doesn't work. How are different arity feature functions dealt with, and what is the correct value for m?
|
| Charles Elkan
|
131
|
 |
|
12-04-2007 01:40 PM ET (US)
|
|
Edited by author 12-04-2007 01:41 PM
I'm a bit confused about what function to use with Carl Rasmussen's non-linear optimizer. The example shows it being used with the Rosenbrock function which returns the function value, the partial derivatives, and the Hessian of the (general dimension) function. I'm guessing however that we only need to supply it with a function that returns the function value and the partial derivatives, right?
Correct. Use checkgrad.m to verify that the partial derivatives you supply are correct.
Then what might this function be? ... should we write one? Perhaps one that calculates the log likelihood? But what would the partial derivatives be?
Yes, you should supply the log-likelihood as your function to maximize. The partial derivatives are given in my lecture notes on page 7.
|
| Matt Jacobsen
|
130
|
 |
|
12-04-2007 01:57 AM ET (US)
|
|
I'm a bit confused about what function to use with Carl Rasmussen's non-linear optimizer. The example shows it being used with the Rosenbrock function which returns the function value, the partial derivatives, and the Hessian of the (general dimension) function. I'm guessing however that we only need to supply it with a function that returns the function value and the partial derivatives, right?
Then what might this function be? Would the Resenbrock function be appropriate? Or should we write one? Perhaps one that calculates the log likelihood? But what would the partial derivatives be?
I'm a bit lost on this. Any help would be appreciated.
Thanks, Matt
|
| Charles Elkan
|
129
|
 |
|
12-03-2007 10:58 PM ET (US)
|
|
<i> By placing hyphens in novel english words correctly do you mean breaking the word into its syllables? <\i> Generally a hyphen is only allowed between syllables, but some syllable breaks are not desirable hyphen places. For example, you don't want a hyphen after the first letter of "asymptomatic."
The task here is shallow: just to learn to predict hyphen locations as they are found in the training data. Syllabification is a deeper, related, task.
|
| Matt Rodriguez
|
128
|
 |
|
12-03-2007 09:21 PM ET (US)
|
|
Quick question about the project
By placing hyphens in novel english words correctly do you mean breaking the word into its syllables?
Thanks, Matt
|
| Charles Elkan
|
127
|
 |
|
12-03-2007 03:37 PM ET (US)
|
|
Are you going to be covering the Viterbi algorithm and matrix multiplication algorithm in class? The paper is kind of confusing. In particular, we are having trouble understanding U(k,v).
Yes, these two algorithms will be the main topic tomorrow, Tuesday. U(k,v) is an entry in a 2D matrix. k is an integer between 1 and the length of ybar. v is one of the possible values for y_k. Each value U(k,v) is computed based on the values in the previous row, which is U(k-1,:).
Note the typo. in the first equation on page 5 of the notes: y_k - 1 should be y_{k-1}.
|
| Kristen Jaskie
|
126
|
 |
|
12-03-2007 02:59 PM ET (US)
|
|
Are you going to be covering the Viterbi algorithm and matrix multiplication algorithm in class? The paper is kind of confusing. In particular, we are having trouble understanding U(k,v).
|
| Charles Elkan
|
125
|
 |
|
11-30-2007 04:42 PM ET (US)
|
|
|
| Charles Elkan
|
124
|
 |
|
11-27-2007 10:35 AM ET (US)
|
|
1) When we tried T = 20 and 10, the process takes only a few iterations to converge (i.e. the euclidean distance between subsequent thetas (or f) array falls below a threshold of 10^-6). Usually, if we follow the 20 or 10 with 8, 6, or 5, the EM runs many iterations before converging (several hundred to 1000). While it's iterating, the E(LL) is always increasing, but the change in theta usually starts out decreasing, but then starts to increase before falling below the threshold of 10^-6. Then suddenly, it'll just fall off. After that, the remaining values of T all converge quickly (< 100 iterations).
I am not surprised by the above. When T is very high (what this is is problem-dependent) then the LL surface is vey smooth and convergence is fast.
We are not supposed to do anything to theta between T values right? we don't have to scale or multiply them by the root that we were taking it at do we.
Correct.
2) the theta that we get frrom deterministic annealing has the same probability distribution at each position across all letters.
This may be because you started with T too high and the LL surface was flat.
Should be be stopping iterations once the change in theta increases for a given value of T? I'm wondering if maybe we let EM run "too far."
No.
UPDATE: I tried T = [2 1], and we seemed to get reasonable values for theta when I let that run to convergence.
Good!
|
| Jerry Fu
|
123
|
 |
|
11-27-2007 02:33 AM ET (US)
|
|
Edited by author 11-27-2007 02:41 AM
When we run our EM algorithm with deterministic annealing, we're noticing a strange (possibly) things.
1) When we tried T = 20 and 10, the process takes only a few iterations to converge (i.e. the euclidean distance between subsequent thetas (or f) array falls below a threshold of 10^-6). Usually, if we follow the 20 or 10 with 8, 6, or 5, the EM runs many iterations before converging (several hundred to 1000). While it's iterating, the E(LL) is always increasing, but the change in theta usually starts out decreasing, but then starts to increase before falling below the threshold of 10^-6. Then suddenly, it'll just fall off. After that, the remaining values of T all converge quickly (< 100 iterations).
We are not supposed to do anything to theta between T values right? we don't have to scale or multiply them by the root that we were taking it at do we.
2) the theta that we get frrom deterministic annealing has the same probability distribution at each position across all letters.
Should be be stopping iterations once the change in theta increases for a given value of T? I'm wondering if maybe we let EM run "too far."
UPDATE: I tried T = [2 1], and we seemed to get reasonable values for theta when I let that run to convergence.
|
| Charles Elkan
|
122
|
 |
|
11-26-2007 03:42 PM ET (US)
|
|
How many local optima can there be in terms of N number of training examples, W length of subsequence, and L number of letters?
I'm pretty sure the answer is an exponential number. It has been proved that with a certain formalization of the task, finding the globally optimal motif is NP-hard.
|
| Charles Elkan
|
121
|
 |
|
11-26-2007 03:40 PM ET (US)
|
|
Edited by author 11-26-2007 03:40 PM
what does it mean to get Z_i1 > 0.50 for a particular protein?
Z values are not defined for whole proteins. They are just defined for subsequences. Z(protein i, position j) > 0.5 means over 50% chance that *this* subsequence is a motif instance.
Lastly, the project description says that we should implement a streamlined version of MEME that fixes any mathematical errors. I've been busy implementing what is described, although I've left out some thing like erasing that we don't need. But does anyone have suggestions or hints on what needs to be fixed with the original implementation?
"Streamlined" means only one of OOPS, ZOOPS, etc., no multiple motifs, no automatic width determination, etc. "fixes any mathematical errors" just means beware that some notation or other details may be wrong in the original paper, just as it may be wrong in any published paper. Our (my) understanding of EM is better now than it was in 1996 (but still imperfect!).
|
| Charles Elkan
|
120
|
 |
|
11-26-2007 03:34 PM ET (US)
|
|
Are we supposed to normalize the Z_i1 vector for all subsequences of length W such that the sum over each subsequence is <= 1? This is described as the z_ij matrix (small z matrix) in the paper.
You do not need to do this normalization. It is important in practice for DNA motifs. It is less important for protein motifs. You do not need to do it now partly because it is just a heuristic. Some recent research has provided an alternative approach to avoiding motifs that self-overlap that is more mathematically justified, but that is far outside the scope of this project.
|
| Charles Elkan
|
119
|
 |
|
11-26-2007 03:31 PM ET (US)
|
|
Edited by author 11-26-2007 03:31 PM
The log-likelihood is slightly worse, at the end, compared to not using DA. Using DA and MEME's theta1, value is -65223. Not using DA, but using MEME's theta1, value is -65072. Using our own initial theta1 with and without DA, our values are the same, -65015.
There is no guarantee that any particular DA schedule is useful for any particular task, so these results don't necessarily need an explanation.
When starting with our initial theta1, the log-likelihood values are always increasing with every iteration. However, when starting with MEME's theta1 this is not true. During the t=10 stage, the first 6 iterations actually decrease the log-likelihood (approaching -65223). If I didn't see that our own theta1 + DA experiment had always increased log-likelihood, I'd expect this could happen because of the soft assignments.
I think it is possible that the extra-soft assignments used by DA may cause the log-likelihood not to increase at every iteration. This phenomenon could show up for some uses of DA but not others.
But now I'm very confused. Why would log-likelihood decrease when run using MEME's theta1?
DA is one candidate explanation. Another may be that standard EM is only guaranteed to converge to a saddle point, not necessarily to a local maximum. Unfortunately I don't know any good readable full analysis of EM's convergence properties. This is frustrating because without a full theoretical analysis, it's hard to know whether the code might have a bug.
|
| Matt Rodriguez
|
118
|
 |
|
11-26-2007 03:03 PM ET (US)
|
|
How many local optima can there be in terms of N number of training examples,W length of subsequence, and L number of letters? Can this be determined?
I know in a polynomial of root n there can be n -1 local optima, but this is a lot more complicated than that.
Cheers
|
| Jerry Fu
|
117
|
 |
|
11-26-2007 02:21 PM ET (US)
|
|
This was mentioned earlier in the board, but what does it mean to get Z_i1 > 0.50 for a particular protein? Does it mean that some subsequence of length W from that protein had a value > 0.5?
Also, do we need to look at value from the Z matrix? I suppose it is one way of figuring out which subsequence belong to the motif distribution, but this information is factored into theta already (except that theta only gives a probability distribution of letters and positions).
Speaking of theta, I know people have said that the values are not as extreme as MEME, but in my theta, the highest values that I am getting are about 0.2 (for any given letter in a given position)...that doesn't seem very extreme at all.
Lastly, the project description says that we should implement a streamlined version of MEME that fixes any mathematical errors. I've been busy implementing what is described, although I've left out some thing like erasing that we don't need. But does anyone have suggestions or hints on what needs to be fixed with the original implementation? The one thing I see so far changing is not using the z_ij as described in the paper, because it ends up being pretty sparse, and just working with a vector of the subsequences. This doesn't really streamline too much though, besides not needing to create an extra matrix.
|
| Jerry Fu
|
116
|
 |
|
11-26-2007 02:01 PM ET (US)
|
|
"Are we supposed to normalize the Z_i1 vector for all subsequences of length W such that the sum over each subsequence is <= 1? This is described as the z_ij matrix (small z matrix) in the paper. Are we supposed to use this z_ij matrix (small z) during the M-step instead of the Z_ij (big Z) matrix for j=1,2 and i=1..n?
In the paper this is done to prevent the EM from converging to simple motifs, but this then breaks the property making each a probability, Z_i1 + Z_i2 = 1 (unless we're supposed to re-scale Z_i2 as well?) Or is there some other way to prevent convergence to simple motifs?
I had thought we did it correctly, until I re-read this part. Could someone clarify? Thanks!"
I think we are supposed to normalize it. But one catch is that the way this is described in the paper, it only normalizes over zij that are on the same line (i.e. from the same protein). My program has all the subsequences of length W stored in a vector like you mentioned (referred to as X in the paper), but I also create the matrix z_ij as described in the paper using the vector X and the lengths of each protein in the dataset Y. The thing is, I use z_ij only to normalize...just about everything else is performed using X, so I think I could probably write a loop to normalize X directly. I think the loops would have to use the lengths though, because in the array X you can't tell when you've wrapped around to a new line of z_ij (i.e. a new protein).
As for Zi2, I though we can get away without normalizing it, because I thought it is only used only for the recalculating f0...but I just noticed it is also being used for lambda. so maybe we do need to normalize that one also.
|
| Matt Jacobsen
|
115
|
 |
|
11-26-2007 12:29 PM ET (US)
|
|
<i.More alarmingly, when I start with MEME's "letter-probability matrix" as my initial theta1, but run EM with a 10,1 deterministic annealing T schedule, I get a Z_ij matrix that attributes absolutely all subsequences to the background model (i.e. all Z_i1 are 0 and Z_i2 are 1). This is even more confusing.
Yes, this is strange. Is the log-likelihood increasing at every iteration; Is the final log-likelihood better than without DA?</i>
The log-likelihood is slightly worse, at the end, compared to not using DA. Using DA and MEME's theta1, value is -65223. Not using DA, but using MEME's theta1, value is -65072. Using our own initial theta1 with and without DA, our values are the same, -65015.
When starting with our initial theta1, the log-likelihood values are always increasing with every iteration. However, when starting with MEME's theta1 this is not true. During the t=10 stage, the first 6 iterations actually decrease the log-likelihood (approaching -65223). If I didn't see that our own theta1 + DA experiment had always increased log-likelihood, I'd expect this could happen because of the soft assignments. But now I'm very confused. Why would log-likelihood decrease when run using MEME's theta1?
Thanks, Matt
|
| Charles Elkan
|
114
|
 |
|
11-26-2007 02:05 AM ET (US)
|
|
Using the "letter-probability matrix" as the initial theta1 (along with all my other initial parameters) and no annealing, I converge to the same results as MEME in just 2 iterations. When I calculate the sum log likelihood after this convergence, I get a value of -65223, whereas my not-so-crisp results produce a sum log likelihood of -65015 consistently.
Yes, what you are doing is the best way of finding the log-likelihood of the MEME output. It seems your model has better log-like. than the MEME output. A plausible explanation is that MEME is tuned to find crisp motifs; maximizing likelihood is not the perfect objective from a biological perspective.
More alarmingly, when I start with MEME's "letter-probability matrix" as my initial theta1, but run EM with a 10,1 deterministic annealing T schedule, I get a Z_ij matrix that attributes absolutely all subsequences to the background model (i.e. all Z_i1 are 0 and Z_i2 are 1). This is even more confusing.
Yes, this is strange. Is the log-likelihood increasing at every iteration; Is the final log-likelihood better than without DA?
It is possible for EM-style methods to converge to a model where some components are unused, like you are finding with DA. In this case you have a bad local optimum and you need to jump out of it somehow.
In general, note that none of the following are guaranteed: DA will find the global optimum. MEME finds the global optimum. The global optimum is the optimum from the end-user perspective. The 10,1 DA schedue is effective for any particular task.
|
| Matt Jacobsen
|
113
|
 |
|
11-26-2007 01:26 AM ET (US)
|
|
I don't quite understand how to calculate the the likelihood from the MEME output. In my code, I calculate the log likelihood for each subsequence and sum it together. But this requires my final theta1, theta2, lambda1, lambda2, and Z_ij matrix. I don't get this from the MEME output. I only get (as far as I can see) the theta1 matrix, called the "letter-probability matrix". Am I missing something from the MEME output? Is there a better way of calculating the likelihood using the MEME output?
Using the "letter-probability matrix" as the initial theta1 (along with all my other initial parameters) and no annealing, I converge to the same results as MEME in just 2 iterations. When I calculate the sum log likelihood after this convergence, I get a value of -65223, whereas my not-so-crisp results produce a sum log likelihood of -65015 consistently. I would expect the log likelihood to be some kind of indicator of how well the Z_ij, theta, and lambda categorize the subsequence examples. So how can this be?
More alarmingly, when I start with MEME's "letter-probability matrix" as my initial theta1, but run EM with a 10,1 deterministic annealing T schedule, I get a Z_ij matrix that attributes absolutely all subsequences to the background model (i.e. all Z_i1 are 0 and Z_i2 are 1). This is even more confusing. Shouldn't the deterministic annealing approach find the global optimum (especially if it starts with a theta1 that's the result of finding the global optimum)?
Thanks, Matt
|
| Charles Elkan
|
112
|
 |
|
11-26-2007 12:04 AM ET (US)
|
|
if I get a single subsequence with Z_i1 > 0.5 for a particular protein, is it the case that it has at least 1 motif? Because it seems like I'm getting at least 1 subsequence instance that's attributed > 50% to the motif component for all proteins I've tested. So I'm not encountering any 0 motif examples when I believe MEME is for the same protein in the same test.
It sounds like you are finding a motif that "hits" very many subsequences. This is consistent with getting a motif model that is not crisp, i.e. does not have very high probabilities for any aa in any position, as described in /m107. So my guess is that this is another aspect of the problem of not finding a good local optimum.
|
| Charles Elkan
|
111
|
 |
|
11-25-2007 11:59 PM ET (US)
|
|
MEME converges to more extreme values than our model. Our model has less extreme values and the values stay constant over position. This happens for letters that have a higher background probability. Has anybody else seen this? We know we are running our model to convergence, but why does it not converge as sharply as MEME does?
My guess is that your code is converging to a bad local optimum. You could try starting with the MEME output. You should find (a) this has better likelihood and (b) your code then converges to something with even better likelihood.
MEME does have a successful domain-specific way of finding god local optimum. The idea is as follows:
(1) Use every subsequence as an initial guess; run EM for just a few iterations from it. (2) For just the top few subsequences from (1), run EM to convergence.
|
| Matt Jacobsen
|
110
|
 |
|
11-25-2007 09:44 PM ET (US)
|
|
I'm still a bit confused about how we should select the best subsequence representing a motif so as to compare with MEME's subsequence output. I believe it was posted earlier that we can find these subsequences by selecting the ones that have a higher probability of being from the motif component (i.e. Z_i1 > 0.5). But this selects many subsequences (first problem). I can look at all of them to see if one matches the one MEME selects as its "motif 1 site" but that seems rather unscientific.
I though I could select the subsequence with the largest Z_i1 probability but now that I'm using deterministic annealing, there are many now with 1.0 values (second problem).
This then begs the question again, if I get a single subsequence with Z_i1 > 0.5 for a particular protein, is it the case that it has at least 1 motif? Because it seems like I'm getting at least 1 subsequence instance that's attributed > 50% to the motif component for all proteins I've tested. So I'm not encountering any 0 motif examples when I believe MEME is for the same protein in the same test.
Thanks, Matt
|
| Stefano
|
109
|
 |
|
11-25-2007 09:35 PM ET (US)
|
|
Hi Matt,
Yes, I've noticed that specifically the theta1 matrix does not have as large (extreme) probabilities for the "most-likely" proteins for each position.
I thought it had something to do with my incorrect interpretation (and implementation) of the normalization of the Z_i1 vector. But, when I looked at the theta1 probability matrix, the corresponding motif was not trivial (well, at least not a single protein or 2 repeated proteins) using the Z_i1 for the M-step.
So, the largest probabilities in our theta1 are noticeably larger than the rest for a particular position in the subsequence, but not as large as the ones MEME creates, such as 0.93 or 0.96, etc...
Our likelihood improves each iteration as well, which is why I think its ok...
I'm going to venture a guess and say the initial starting values for the theta1 and theta2 are too far away from the desired ones and we've reached a local max. I've tried some different (simple) initialization methods, but so far it hasn't cause the theta values to be more extreme. This is for EM not the DA.
Sorry I couldn't answer your question, but it seems to be happening to us as well.
|
| Stefano
|
108
|
 |
|
11-25-2007 09:20 PM ET (US)
|
|
Are we supposed to normalize the Z_i1 vector for all subsequences of length W such that the sum over each subsequence is <= 1? This is described as the z_ij matrix (small z matrix) in the paper. Are we supposed to use this z_ij matrix (small z) during the M-step instead of the Z_ij (big Z) matrix for j=1,2 and i=1..n?
In the paper this is done to prevent the EM from converging to simple motifs, but this then breaks the property making each a probability, Z_i1 + Z_i2 = 1 (unless we're supposed to re-scale Z_i2 as well?) Or is there some other way to prevent convergence to simple motifs?
I had thought we did it correctly, until I re-read this part. Could someone clarify? Thanks!
|
| Matt Rodriguez
|
107
|
 |
|
11-25-2007 09:14 PM ET (US)
|
|
Hi everybody
My partner and I believe that we are running the EM correctly because the likelihood improves at each iteration. However we have noticed that our theta1 model parameter does not have as extreme values as the theta1 in MEME.
We've generated a simplified position probability matrix by multiplying each value by 10 and rounding it and we get something like this:
A 1:122:1::1::11111111 C :::::::::::::::::::: D :111:::::::::::::1:1 E 1:::::::::1:1::1:::: F 1:::::::11::1::::1:: G 111111212221111121:2 H :::::::::::::::::::: I 22111:1121:22:11:11: K :11:1:1::::1::1::::1 L :111:::1::::1::1::1: M 1::1:1:::::::::1::1: N ::1111:1:111:111:1:: P :::::1::1::1:11::1:1 Q :::::::::::::11::::: R 1111:1:1:1:::1:::::: S 1:::11:11:11:11:2:11 T :1:11:11:1::11::1:11 V :::1:121221111211211 W :::::::::::::::::::: Y ::::::::::1:1:11:11: X :::::::::::::::::::: background model A 1.096481e-001 C 6.018062e-003 D 4.623882e-002 E 6.088052e-002 F 2.949489e-002 G 9.469746e-002 H 1.283185e-002 I 5.853482e-002 K 5.132334e-002 L 9.456777e-002 M 2.871076e-002 N 4.549275e-002 P 3.586063e-002 Q 2.397957e-002 R 4.409827e-002 S 6.312976e-002 T 6.590536e-002 V 9.949338e-002 W 8.888773e-003 Y 2.020036e-002 X 4.714461e-006
Now when we run MEME the simplified position probability matrix is this Simplified A :::3::::::3::::::::: pos.-specific C :::::::::::::::::::: probability D ::3:::::::::3::::5:: matrix E :::::::::::::::::::: F :::::::::::::::::::: G 53::::::3:::53:::::3 H ::3:::::::5::::::::: I 3:::a::::3::::3::::: K ::33:::::::::::::::: L :::::::::::::::3:::: M :::::::::::::3::::a3 N :::3:::8::3::::3:::: P :3:::::::::a:::::5:: Q 3::::::::::::::::::: R :3:3:a:::::::::::::: S ::::::::8::::::::::: T :3::::::::::3:3:a::5 V ::::::a3:8:::55::::: W :::::::::::::::::::: Y ::3::::::::::::5::::
There are at least 2 things to note. MEME converges to more extreme values than our model. Our model has less extreme values and the values stay constant over position. This happens for letters that have a higher background probability. Has anybody else seen this? We know we are running our model to convergence, but why does it not converge as sharply as MEME does?
Thanks, Matt
|
| Charles Elkan
|
106
|
 |
|
11-24-2007 11:27 PM ET (US)
|
|
"For Zij(0), the part specifically that I don't get is the P(Xi|theta_j). Are we supposed to use equations (7) and (8) to calculate that?" Yes, this is the E-step.
"And if that is the case, then we estimate the f_ij's from the data, specifically on the subsequences of length W?" Yes, this is the M-step.
"Then lambda is set initially (I think it is discussed in the paper and earlier on this board), and we can calculate Z_ij(0)."
Yes. As part of each M-step you recompute lambda and all f_ij. Then you use these values in the next E-step. This next E-step gives you new Zij values. You use these in the next M-step, and so on.
|
| Jerry Fu
|
105
|
 |
|
11-24-2007 10:13 PM ET (US)
|
|
I'm a little confused on how we're supposed to calculate Zij(0) (equation (4)). From the paper, I didn't quite get which parameters we are supposed to get directly from the data.
For Zij(0), the part specifically that I don't get is the P(Xi|theta_j). Are we supposed to use equations (7) and (8) to calculate that? And if that is the case, then we estimate the f_ij's from the data, specifically on the subsequences of length W?
Then lambda is set initially (I think it is discussed in the paper and earlier on this board), and we can calculate Z_ij(0).
|
| Charles Elkan
|
104
|
 |
|
11-21-2007 02:10 PM ET (US)
|
|
I believe X means "unknown amino acid." There is a whole code of letters that denote various ambiguities, i.e. subsets of amino acids. The real-world MEME can deal with these codes. You can just eliminate the subsequences from this protein that contain the X.
Charles
On Wed, 21 Nov 2007, QT - Klinton Bicknell wrote:
> < replied-to message removed by QT >
|
| Klinton Bicknell
|
103
|
 |
|
11-21-2007 12:52 PM ET (US)
|
|
The protein MAS1_AGRRA has a single X in its sequence, a character not found in any of the other proteins, and the output for MEME does not include X in its vocabulary. Is this a mistake in our data that wasn't present in the data MEME got? Or is MEME just ignoring units of vocabulary that don't occur below a certain threshold number of times?
|
| Charles Elkan
|
102
|
 |
|
11-21-2007 01:06 AM ET (US)
|
|
"After running EM to convergence, we'll have probability distributions per letter for the background model and per letter, per position for the motif model. But how can we detect what the best motif is from these distributions?"
The best motif *is* the foreground distributions.
Here "best" means "part of the model with highest likelihood, as found by EM." For all the reasons mentioned in class, this "best" motif may not be exactly the one that biologists would consider best.
"How would we know that no motif is present?"
For each subsequence, z is the probability that it was generated by the foreground component. If z > 0.5 then you can say that this subsequence is an instance of the motif. Otherwise, say that it is an instance of the background, i.e. not of the motif.
|
| Charles Elkan
|
101
|
 |
|
11-21-2007 12:48 AM ET (US)
|
|
"Regarding the midterm: Will there be posted a solution example?" I don't have a sample solution ready currently. Feel free to ask here about any particular question you are unsure about.
|
| Matt Jacobsen
|
100
|
 |
|
11-20-2007 09:11 PM ET (US)
|
|
After running EM to convergence, we'll have probability distributions per letter for the background model and per letter, per position for the motif model. But how can we detect what the best motif is from these distributions? We're thinking that we can subtract the background probabilities from the motif probabilities on a per position basis. Then assume the maximum probability left per position represents the motif letter in that position. But this would produce a motif every time and of length W. So that doesn't seem correct. How would we know that no motif is present?
Thanks, Matt
|
| Jostein Oysad
|
99
|
 |
|
11-20-2007 08:45 PM ET (US)
|
|
Regarding the midterm: Will there be posted a solution example?
|
| Charles Elkan
|
98
|
 |
|
11-19-2007 09:05 PM ET (US)
|
|
> Last year's midterm was open book and calculator allowed. Does > the same rule apply for this year's midterm? Since we don't have > required text books, are we allow to bring our notes?
Yes to everything.
|
| Sean Park
|
97
|
 |
|
11-19-2007 02:08 PM ET (US)
|
|
Last year's midterm was open book and calculator allowed. Does the same rule apply for this year's midterm? Since we don't have required text books, are we allow to bring our notes?
Thanks,
Good luck everyone~
|
| Charles Elkan
|
96
|
 |
|
11-19-2007 12:16 PM ET (US)
|
|
"1) How is overfitting dealt with in unsupervised algorithms. We want to figure out parameters and whatnot to our algorithm. Do we need to do this on a subset of the data rather than all of the data to avoid overfitting? (validation set vs whole set)? Normally the validation set is chosen out of the final training set to avoid biasing the final results, but here we only have one set so how does this work?"
There is no good answer to this question. Commonly, one does smoothing during training so that zero-probability in the training set does not imply zero-probability on the test set. The strength of smoothing can be determined by measuring likelihood on a test/validation set. This set may be independent, or simulated via cross-validation.
> 2) Should we worry about the Beta value in equation 13? If so, what > should it be set to originally? In the implementation section it just > says that Beta is user-specified (based on information available about > the distributions of the motifs and background) and we cannot find any > further elaboration in the paper (what information about the > distributions?). If it is user-specified, what should we specify it as?
This beta_j is the strength of smoothing for letter number j. It is essentially the same as the pseudo-count for a naive Bayes classifier. So you can start with each beta_j = 0.1. Feel free to do experiments with different ways of setting beta_j (they do not have to be all equal).
"3) What about the Erasing factor described in footnote 1? Should we worry about this?"
You don't need to worry about this, because it is only relevant when finding two or more motifs. Your job is just to find one.
|
| Jostein Oysad
|
95
|
 |
|
11-18-2007 11:38 PM ET (US)
|
|
Couple of questions:
1) How is overfitting dealt with in unsupervised algorithms. We want to figure out parameters and whatnot to our algorithm. Do we need to do this on a subset of the data rather than all of the data to avoid overfitting? (validation set vs whole set)? Normally the validation set is chosen out of the final training set to avoid biasing the final results, but here we only have one set so how does this work?
2) Should we worry about the Beta value in equation 13? If so, what should it be set to originally? In the implementation section it just says that Beta is user-specified (based on information available about the distributions of the motifs and background) and we cannot find any further elaboration in the paper (what information about the distributions?). If it is user-specified, what should we specify it as?
3) What about the Erasing factor described in footnote 1? Should we worry about this?
Thanks for any help!
|
| Charles Elkan
|
94
|
 |
|
11-18-2007 08:55 PM ET (US)
|
|
"Is the theta parameter for a likelihood function that is fitting a multinomial is theta = (p1, p2, p3, ... pL), where there are L possible outcomes and pi is the probability of that outcome?"
Correct.
"To maximize theta I need to be the likelihood function will be something like this L(X;theta) = C p1^x1^p2^x2...pL^xL. C is a term in the multinomial distribution pmf that isn't important to maximize the likelihood because it doesn't depend on theta.
Correct.
What is the log likelihood function that is to be maximized? Is it SUM_i pi^xi? If so then how do I maximize that function with respect to theta?
Remember to take logs inside the product. The log likelihood is SUM_i x_i*(log p_i). You can maximize this using a Lagrange multiplier to ensure SUM_i p_i = 1.
|
| Matt Rodriguez
|
93
|
 |
|
11-18-2007 08:11 PM ET (US)
|
|
I've been thinking about likelihood functions and fitting multinomial distributions and I want to ensure that my thinking is correct.
1. Is the theta parameter for a likelihood function that is fitting a multinomial is theta = (p1, p2, p3, ... pL), where there are L possible outcomes and pi is the probability of that outcome?
2. To maximize theta I need to be the likelihood function will be something like this L(X;theta) = C p1^x1^p2^x2...pL^xL. C is a term in the multinomial distribution pmf that isn't important to maximize the likelihood because it doesn't depend on theta. What is the log likelihood function that is to be maximized? Is it SUM_i pi^xi? If so then how do I maximize that function with respect to theta?
Thanks
|
| Charles Elkan
|
92
|
 |
|
11-17-2007 08:13 PM ET (US)
|
|
"In the sample output, I understand how to get Simplified probability matrix but I don't really get the information content graph with bits on it."
"Bits of information" means entropy. The entropy of a discrete distribution is SUM_i p_i*log(p_i) where i ranges over the possible outcomes and p_i is the prob. of outcome i. Higher entropy means higher uncertainty means more bits of information. An ideal motif has low entropy in each column. However, real-world motifs usually have some columns that are not strongly conserved, i.e. do not have low entropy.
Note that entropy is maximized by a uniform distribution.
|
| Charles Elkan
|
91
|
 |
|
11-17-2007 08:09 PM ET (US)
|
|
"one way to initialize a deterministic annealing EM is to choose theta/lambda to maximize the amount of search that EM will do by making each initial component a perturbed version of a global model of the data. Can you elaborate on this? I'm not quite sure what that means. Does that mean assume 1 component (say only background or only motif) and assume some kind of distribution for that 1 component?"
Yes. Fit a multinomial to the entire dataset by ML. This will already be approximately the correct background model. Initialize the motif-component to have every column equal to this multinomial, but with random changes to the 20 probabilities.
|
| Sean Park
|
90
|
 |
|
11-17-2007 03:46 PM ET (US)
|
|
In the sample output, I understand how to get Simplified probability matrix but I don't really get the information content graph with bits on it. Can anyone explain what the bits are?
Thanks,
|
| Matt Jacobsen
|
89
|
 |
|
11-17-2007 02:45 PM ET (US)
|
|
My partner and I are trying to figure out a good way to produce initial parameters and are a bit stumped. In the Mixture Models notes, you suggest that one way to initialize a deterministic annealing EM is to choose theta/lambda to maximize the amount of search that EM will do by making each initial component a perturbed version of a global model of the data. Can you elaborate on this? I'm not quite sure what that means. Does that mean assume 1 component (say only background or only motif) and assume some kind of distribution for that 1 component?
Any clarification would help.
Thanks, Matt
|
| Charles Elkan
|
88
|
 |
|
11-16-2007 05:03 PM ET (US)
|
|
"As I read the Bailey & Elkan paper, it seems that the indicator function I(k,a) in equations 7 and 8 is really defined as 1 if a = k, 0 otherwise. Because the arguments to it in those equations are k and X_ij. Here k is a letter (from the summation) and so is X_ij (the letter in the jth position from sample X_i). It doesn't seem to make sense otherwise, right?"
Essentially you are right. I(k,X_ij) = 1 iff X_ij is the kth letter, i.e. letter number k. Earlier the paper says that the kth letter is a_k, which is not quite the same thing as saying that the kth letter is the integer k.
It probably would have been better to use the notation I(X_ij = a_k).
|
| Matt Jacobsen
|
87
|
 |
|
11-16-2007 04:52 PM ET (US)
|
|
As I read the Bailey & Elkan paper, it seems that the indicator function I(k,a) in equations 7 and 8 is really defined as 1 if a = k, 0 otherwise. Because the arguments to it in those equations are k and X_ij. Here k is a letter (from the summation) and so is X_ij (the letter in the jth position from sample X_i). It doesn't seem to make sense otherwise, right?
Thanks, Matt
|
| Charles Elkan
|
86
|
 |
|
11-16-2007 12:55 AM ET (US)
|
|
You can find last year's midterm here: http://www-cse.ucsd.edu/users/elkan/250B/oldmidterm.pdfLast year I covered some topics in a different order, so last year's students had more background for some of these questions, especially the first one and the true/false statements 3, 4, 9, 10.
|
| Kristen Jaskie
|
85
|
 |
|
11-15-2007 08:51 PM ET (US)
|
|
Midterm review? You asked us to remind you.
|
| Charles Elkan
|
84
|
 |
|
11-13-2007 02:19 AM ET (US)
|
|
You are right, there are only 33 protein sequences in the adh.s dataset. Just use these. I'm not sure where I got the number 132...
|
| Jostein Oysad
|
83
|
 |
|
11-12-2007 10:38 PM ET (US)
|
|
For project 3, the dataset appears to have only 33 protein sequences. This is also the number that the MEME output describes. In the writeup, you said that our experiments should use 132 proteins. Are we misreading the data or is that a typo? Thanks!
|
| Charles Elkan
|
82
|
 |
|
11-10-2007 01:15 AM ET (US)
|
|
Yes, the deadline for Assignment 3 is Tuesday November 27 in class. Sorry for my mistake on the handout.
The midterm is on November 20, remember.
"The assignment says that it is due Nov 20, which I think is the Tuesday before Thanksgiving ( and the same day as the midterm!)...should that be Nov 27?"
|
| Jerry Fu
|
81
|
 |
|
11-09-2007 02:36 AM ET (US)
|
|
The assignment says that it is due Nov 20, which I think is the Tuesday before Thanksgiving ( and the same day as the midterm!)...should that be Nov 27?
|
| Sean Park
|
80
|
 |
|
11-08-2007 09:34 PM ET (US)
|
|
Hey guys, I'm looking for a enthusiastic person as a partner who's willing to start early and work hard. I'll work as hard, too. please email me at orgpark@gmail.com
Thank you,
|
| Charles Elkan
|
79
|
 |
|
11-07-2007 06:52 PM ET (US)
|
|
The new 250B project description is available here: http://www.cs.ucsd.edu/users/elkan/250B/project3.htmlThe deadline is the Tuesday after Thanksgiving, but please do start work as soon as possible. You will see that the project requires you to absorb a fair amount of background knowledge about protein sequence analysis. Do make this process efficient by asking clarification questions here.
|
| Charles Elkan
|
78
|
 |
|
11-05-2007 08:19 PM ET (US)
|
|
"Are we supposed to report on the performance on the 95K example validation set provided on the web site? Or just on the test set(s) carved out of the 95K example training set we downloaded?" Either or both, your choice.
"I'm remembering that you said not to download the true values of the validation set, is that still the case?"
You can download the true labels of the 95K official test set now, if you like, but you can only run *one* experiment using these.
|
| Stefano
|
77
|
 |
|
11-05-2007 08:05 PM ET (US)
|
|
Are we supposed to report on the performance on the 95K example validation set provided on the web site? Or just on the test set(s) carved out of the 95K example training set we downloaded?
I'm remembering that you said not to download the true values of the validation set, is that still the case?
|
| Charles Elkan
|
76
|
 |
|
11-05-2007 07:07 PM ET (US)
|
|
"So to obtain calibrated probability, do we actually use the true label of test set?"
No, you use the true labels from a validation set in order to learn the function that maps scores into calibrated probabilities.
"Then using the calibrated prob, we calculate the profit. Am I interpreting it correcly?"
You use the calibrated predictions on the test set to make optimal decisions for the test set. Then, you use the true labels from the test set to measure the actual profit you would have made. You do not use the true labels of the test set for training in any way.
|
| Sean Park
|
75
|
 |
|
11-05-2007 06:41 PM ET (US)
|
|
So to obtain calibrated probability, do we actually use the true label of test set? Then using the calibrated prob, we calculate the profit. Am I interpreting it correcly?
|
| Charles Elkan
|
74
|
 |
|
11-05-2007 11:59 AM ET (US)
|
|
"When you compute each score, you have to include its denominator. Then, these scores are the single x variable in a univariate regression. The y variable is the true label y=0 or y=1."
"is it okay to perform regular linear regression using x and y?"
You can try standard linear regression as a first cut, but it is really not appropriate because you want to guarantee that the predictions yhat = f(x) are in the range 0 to 1. Logistic regression guarantees this, but not linear regression.
|
| Sean Park
|
73
|
 |
|
11-05-2007 11:29 AM ET (US)
|
|
"No... When you compute each score, you have to include its denominator. Then, these scores are the single x variable in a univariate regression. The y variable is the true label y=0 or y=1."
If I understood correctly, is it okay to perform regular linear regression using x and y? and is it sort of isotonic regression?
|
| Sean Park
|
72
|
 |
|
11-05-2007 10:54 AM ET (US)
|
|
Deleted by author 11-05-2007 11:28 AM
|
| Charles Elkan
|
71
|
 |
|
11-05-2007 06:59 AM ET (US)
|
|
You wrote ""scores" are the numerators we calculate for each examples, and in our case, we want to use the value of N(C=donor) over all training examples as the y-variable in a regression, correct?"
No... When you compute each score, you have to include its denominator. Then, these scores are the single x variable in a univariate regression. The y variable is the true label y=0 or y=1.
"So then do we scale the scores so that they span from 0 to 1" After you include the denominator, this will be guaranteed.
|
| Jerry Fu
|
70
|
 |
|
11-04-2007 10:32 PM ET (US)
|
|
Hi
We're a bit confused about turning the scores generated by the Bayesian classifier into well-calibrated probabilities. We have an idea, but we're not sure if it's correct. "scores" are the numerators we calculate for each examples, and in our case, we want to use the value of N(C=donor) over all training examples as the y-variable in a regression, correct?
So then do we scale the scores so that they span from 0 to 1, and do a regression with the feature values as our variables? The only problem is that I believe what I just described is multiple regression, and the project description says to use isotonic or logistic univariate regression. Thanks for any help on this!
|
| Charles Elkan
|
69
|
 |
|
11-04-2007 04:36 PM ET (US)
|
|
"I'm a bit confused on how to calculate confidence intervals for the total number of donors (predicted) and total amount donated (predicted)." Obtaining a CI for the total amount is harder, so if you obtain a CI for the number of donors, that will be enough.
What you want is P(count in CI) = 0.95. You can assume that the CI is approximately [m - 2s, m+2s] where m = E[count] and s^2 = Var[count]. Expectations and variances are additive, so E[count] = SUM_i p_i where i ranges over all test examples and p_i is the probability that test example i is a donor.
The variance of the count is Var[count] = SUM_i p_i*(1-p_i) since the variance of a single coin flip with probability p_i is p_i*(1-p_i).
|
| Charles Elkan
|
68
|
 |
|
11-04-2007 04:30 PM ET (US)
|
|
"the results of cross validation seems to be too low, aroun 10% precision, which is only improved twice after the algorithm. It is too low, right?"
Not necessarily. I think it is very hard to get precision over 20%, even just on the top 1%.
"Strange thing is that when I apply different features, the outcome does not change much."
This is not surprising. None of the features are individually very predictive. Most features have no predictive value at all.
|
| Sean Park
|
67
|
 |
|
11-04-2007 02:47 PM ET (US)
|
|
Edited by author 11-04-2007 02:49 PM
After discretizing features, I applied Naive Bayes on selected features and assumed top 10% probabilities are donors. However, the results of cross validation seem to be too low, around 10% precision, which is only improved twice after the algorithm. What is the range of precision expected approximately? Strange thing is that when I apply different features, the outcome does not change much. Thanks,
|
| Matt Jacobsen
|
66
|
 |
|
11-04-2007 02:07 PM ET (US)
|
|
I'm a bit confused on how to calculate confidence intervals for the total number of donors (predicted) and total amount donated (predicted). I know how to calculate the predictions, just not how to prove that they're bound by an interval with 95% confidence. I've been reading All of Statistics, and I believe understand the concept: P(p in C_n) >= 1-alpha Here, I'd use alpha=0.05 and p is my prediction. But I'm unsure what C_n is for my distribution. It seems like I have a binomial distribution. So perhaps I can use: C_n=(Xh - epsilon, Xh + epsilon). But I'm unsure how to calculate epsilon using alpha (to ensure 95%) nor what Xh would be in this case. Any suggestions would be helpful.
Thanks, Matt
|
| Charles Elkan
|
65
|
 |
|
11-02-2007 02:07 PM ET (US)
|
|
"I'm thinking about how to provide a numerical estimate for profit that we expect to receive. To answer this question we need to know how many donors we will solicit or have the choice to solicit. How many donors should we assume that we will have the choice to solicit?"
You choose how many donors to solicit. The optinal choice for person x is to solicit iff p(x)*e(x) > 0.68 where p(x) is the estimated calibrated probability that x will donate, and e(x) is the estimated amount s/he will donate, assuming s/he does donate.
"It is possible to have a probability distribution of profit for a single donor."
You don't need a whole distribution. You need just a single best-guess dollar number. You get this from linear regression.
"I don't know how to create a probability distribution for say the number of donors that we will solicit."
You don't need a distribution for this, and not even an estimate. This number is a consequence of making a separate optimal decision for each person.
|
| Matt Rodriguez
|
64
|
 |
|
11-02-2007 01:52 PM ET (US)
|
|
I'm thinking about how to provide a numerical estimate for profit that we expect to receive. To answer this question we need to know how many donors we will solicit or have the choice to solicit. How many donors should we assume that we will have the choice to solicit?
It is possible to have a probability distribution of profit for a single donor but I don't know how to create a probability distribution for say the number of donors that we will solicit. I think this requires sampling the single donor profit probability distribution. What is a good way to construct the probability distribution of the amount that a group of donors will donate?
|
| Charles Elkan
|
63
|
 |
|
11-02-2007 10:38 AM ET (US)
|
|
"Do we have to use all the patterns in the training data set for training? On the hw description page, it says the algorithm implementation must handle up to 100,000 patterns and 100 features, can we use less than that many patterns and features?"
To get the best results, you need to use all training examples, but not all features of each example; see earlier messages on this board. Feel free to use fewer examples while developing your own code. The 100,000 by 100 guideline is to indicate that your code should be efficient.
|
| Charles Elkan
|
62
|
 |
|
11-02-2007 10:36 AM ET (US)
|
|
"I have a question regarding multilinear regression. Should I consider that it tries to fit a line to a set of data points or a plane/hyperplane to a set of datapoints?" What you need for this project is multivariate linear regression: a single scalar y value for each input x value. You can use builtin Matlab functions to do this: http://www.mathworks.com/access/helpdesk/h...1-8450.html#f1-7351 "Does anyone have a comprehensive statistics book that they would recommend?" "All of Statistics" by Wasserman is often recommended.
|
| Stefano
|
61
|
 |
|
11-02-2007 12:35 AM ET (US)
|
|
Do we have to use all the patterns in the training data set for training? On the hw description page, it says the algorithm implementation must handle up to 100,000 patterns and 100 features, can we use less than that many patterns and features?
|
| Matt Rodriguez
|
60
|
 |
|
11-01-2007 08:55 PM ET (US)
|
|
Hi I have a question regarding multilinear regression.
Should I consider that it tries to fit a line to a set of data points or a plane/hyperplane to a set of datapoints?
I'm getting a little tired of learning statistic concepts off of wikipedia so does anyone have a comprehensive statistics book that they would recommend?
Thanks
|
| Charles Elkan
|
59
|
 |
|
11-01-2007 11:12 AM ET (US)
|
|
"However, how is precision calculated for that top 10% "most likely to be donors" group? They all have a predicted value over 5.08%. I know which ones are actual donors. But since the predicted values all less than 50%, I wouldn't actually predict any as donors. Which means I still don't have any predicted positives, right? Which means I cannot calculate precision by: precision = tp/(fp + tp)."
You *do* predict that the top 10% are donors, even though their probability is under 0.5. Therefore, fp+tp = 10% and tp is however many actual donors there are among the top 10%.
Remember that if the donation amount is $20 and the solicitation cost is 68 cents, then it is rational to do the solicitation action as long as 2000p > 68, where p is the probability. Predicting that the top 10% are donors means taking the action for them. This rule-of-thumb is an approximation to making the correct decision, i.e. take the action if and only if p > 68/2000 = 0.034.
In other words, because of the imbalance between costs and benefits, the optimal threshold for decision-making is 0.034, not 0.50.
|
| Matt Jacobsen
|
58
|
 |
|
10-31-2007 01:24 PM ET (US)
|
|
I see. Comparing the predicted values to the base rate will give a measurement of how much better the classifier is as compared to no classifier. That makes sense. However, how is precision calculated for that top 10% "most likely to be donors" group? They all have a predicted value over 5.08%. I know which ones are actual donors. But since the predicted values all less than 50%, I wouldn't actually predict any as donors. Which means I still don't have any predicted positives, right? Which means I cannot calculate precision by: precision = tp/(fp + tp).
Thanks, Matt
|
| Charles Elkan
|
57
|
 |
|
10-31-2007 12:18 AM ET (US)
|
|
"because the class cardinalities are so unbalanced, my classifier doesn't classify anything as a donor. P(Y=donor) = 5.08% and P(Y=non-donor) = 94.92%. This makes all test example have a larger non-donor numerator because in my 1 feature experiments it's always P(Y=donor)*x vs. P(Y=non-donor)*y. For this to ever yield a larger donor numerator (and thus label the example as a donor), x must be ~19 times larger than y. I can't imagine that I'd find any feature that has this kind of distribution. This can't be correct can it?"
Yes, the above is correct. Predicting the class with higher numerator is equivalent to predicting the class that has (uncalibrated) probability over 50%. Because the base-rate for non-donors is so high (94.92%) no combination of feature values can make the predicted probability of this class go below 50%.
What you can do is (for example) select the tenth of test examples that have highest predicted probability of being donors. These predictions will all be above 5.08% and below 50%. Then, compute the precision on this tenth. Hopefully, the precision will be well over 5.08%, but still under 50%.
|
| Matt Jacobsen
|
56
|
 |
|
10-30-2007 09:09 PM ET (US)
|
|
Edited by author 10-30-2007 09:12 PM
In an effort to verify that my Naive Bayes implementation is correct, I reduced the feature set to 1 feature. I was hoping that I could see some results from just using 1 feature to ascertain whether things were working correctly. I first tried using RFA_2A, then RFA_2F as the 1 feature. (Unfortunately, both of those features give fairly even distributions across both classes, donors/non-donors.) What I uncovered is that because the class cardinalities are so unbalanced, my classifier doesn't classify anything as a donor. P(Y=donor) = 5.08% and P(Y=non-donor) = 94.92%. This makes all test example have a larger non-donor numerator because in my 1 feature experiments it's always P(Y=donor)*P(X=x|Y=donor) vs. P(Y=non-donor)*P(X=x|Y=non-donor). For this to ever yield a larger donor numerator (and thus label the example as a donor), P(X=x|Y=donor) must be ~19 times larger than P(X=x|Y=non-donor). I can't imagine that I'd find any feature that has this kind of distribution. This can't be correct can it? Am I missing something here?
Thanks, Matt
|
| Charles Elkan
|
55
|
 |
|
10-29-2007 12:28 AM ET (US)
|
|
Binning is an acceptable method for calibration. I will be talking about isotonic and logistic regression in class. Don't worry if you have to move ahead on the project before then; I will be flexible.
|
| Matt Jacobsen
|
54
|
 |
|
10-29-2007 12:18 AM ET (US)
|
|
I'm a bit confused as to what exactly the postprocessing algorithm can be (to convert predicted scores into well-calibrated probabilities). The project page indicates it should be either isotonic regression or univariate logistic regression. What would be some examples of these kinds of methods? Is there a good place to find some more information? Wikipedia wasn't very helpful to me. I've implemented naive Bayes and was planning on using binning to achieve well calibrated probabilities, but I'm not sure if this qualifies as one of the two regression techniques.
Thanks, Matt
|
| Charles Elkan
|
53
|
 |
|
10-26-2007 10:41 AM ET (US)
|
|
Given the requests and the circumstances, there will be a general extension for the current project by one week, until Tuesday November 6 in class.
I will announce a revised schedule for the remaining projects.
The exam schedule is unchanged: the midterm will be in class on Tuesday November 20 and the final exam will be on Thursday December 13.
|
| Jerry Fu
|
52
|
 |
|
10-26-2007 07:46 AM ET (US)
|
|
I think a general extension would be very help[ful. I have not had to evacuate this week, but I've fallen a bit behind on work just due to getting caught up in the events of the week. Also, with campus closed, I haven't had a chance to meet with my partner at all.
|
| Sean Park
|
51
|
 |
|
10-25-2007 02:12 PM ET (US)
|
|
I think general extension will help us alot. My wife and I have evacuated due to the air quality which might be harmful to her since she's pregnant. I guess many of us have different difficulties during this disaster.
Thanks,
|
| Charles Elkan
|
50
|
 |
|
10-25-2007 01:33 PM ET (US)
|
|
Here is an official campus announcement (extract):
* The UC Office of the President has approved a reduction in days of instruction for this quarter, given the nature of the emergency. Therefore the calendar for the remainder of Fall quarter and for final exams will not be changed.
* Faculty should adjust their syllabus and content for the loss of a week of classes, and should not expect to schedule additional class meetings to make up for that time.
|
| Charles Elkan
|
49
|
 |
|
10-25-2007 01:15 PM ET (US)
|
|
"Is there a plan to make up the 2 lost lectures from this week?"
So far there is no plan. It is conceivable the campus will add on a week of classes in December. Let me stress this is not a predictin and I have no inside knowledge.
"Also, will the due date of project 2 be impacted by this week's cancellations?"
I will be flexible if I get any individual requests. Please post here if you have an opinion or preference about a general extension, or not? I hope you and your families are all safe and sound.
|
| Matt Jacobsen
|
48
|
 |
|
10-24-2007 04:37 PM ET (US)
|
|
Is there a plan to make up the 2 lost lectures from this week? Also, will the due date of project 2 be impacted by this week's cancellations?
Thanks, Matt
|
| Charles Elkan
|
47
|
 |
|
10-23-2007 02:24 PM ET (US)
|
|
Today's lecture is canceled, unfortunately.
I hope all of you and your families are safe and well.
|
| Charles Elkan
|
46
|
 |
|
10-20-2007 12:35 PM ET (US)
|
|
The minimum set of useful features is probably R, F, A (three) and a few measures of previous donation amounts (min, max, mean).
Look at the documentation for the meaning of RFA. Using these was the state-of-the-art in database marketing before the use of trained classifiers.
On Sat, 20 Oct 2007, QT - Sean Park wrote:
> < replied-to message removed by QT >
|
| Charles Elkan
|
45
|
 |
|
10-20-2007 12:32 PM ET (US)
|
|
|
| Sean Park
|
44
|
 |
|
10-20-2007 02:48 AM ET (US)
|
|
How many, approximately, features should we have after ruthlessly reducing features? Is there somewhat minimum number of features?
|
| Sean Park
|
43
|
 |
|
10-20-2007 01:51 AM ET (US)
|
|
I cannot access the linked report, Experimental design for solicitation campaigns by Mayer and Sarkissian. The link tells me to buy it for $10. Prof. Elkan, can you please tell me other way to access it?
|
| Mehran Bozorgi
|
42
|
 |
|
10-17-2007 07:56 PM ET (US)
|
|
My partner decided to drop the course. Is anyone looking for a partner for project 2? if so, send me an email at mbozorgi@cs.ucsd.edu
|
| Charles Elkan
|
41
|
 |
|
10-09-2007 01:51 AM ET (US)
|
|
> We have built 45 pariwise classifiers, and when averaging all of > the accuracies of each, we end up with a value of .9793, or an > error rate of .0207 which is lower than results getting > published in NIPS. This cant be right.
See some previous messages. 98% average accuracy is believable for pairwise classification, but not for ten-way classification. I would guess that the published results you have seen are for making ten-way decisions.
|
| Charles Elkan
|
40
|
 |
|
10-09-2007 01:48 AM ET (US)
|
|
> What is the correct understanding of parametric and > non-parametric in this case?
A parametric classifier has a fixed number of parameters whose values can be learned, regardless of the number of training examples. A nonparametric classifier generally can increase in complexity as the size of the training set increases.
|
| Gideon Prior
|
39
|
 |
|
10-09-2007 12:53 AM ET (US)
|
|
We have built 45 pariwise classifiers, and when averaging all of the accuracies of each, we end up with a value of .9793, or an error rate of .0207 which is lower than results getting published in NIPS. This cant be right.
|
| Justin Oysad
|
38
|
 |
|
10-08-2007 11:55 PM ET (US)
|
|
What is the correct understanding of parametric and non-parametric in this case?
|
| Charles Elkan
|
37
|
 |
|
10-08-2007 11:17 PM ET (US)
|
|
> You said in the class notes that "the (perception) algorithm > works well in practice on high-dimensional separable as well as > nonseparable data. For example, with a kernel function 28 by 28 > images of digits are separable." > > First of all, what is a kernel function? I'll explain this in class soon.
> And how can the digits be separable? It seems like since even > humans have trouble distinguishing between ones and sevens at > times that the data MUST overlap some amount.
Because the training set is finite, it has to be separable in some way. This is another way of saying that zero error is attainable on the training set by some classifier (for example, 1NN). However, a classifier that achieves zero training error is probably overfitting the training set.
|
| Kristen Jaskie
|
36
|
 |
|
10-08-2007 11:05 PM ET (US)
|
|
Edited by author 10-08-2007 11:07 PM
You said in the class notes that "the (perception) algorithm works well in practice on high-dimensional separable as well as nonseparable data. For example, with a kernel function 28 by 28 images of digits are separable."
First of all, what is a kernel function? And how can the digits be separable? It seems like since even humans have trouble distinguishing between ones and sevens at times, that the data MUST overlap some amount. For this reason I didn't even try running the standard perceptron on it, assuming it might fall into an infinite loop. Instead, the Voted Perceptron seemed like a better bet. Is there something I'm missing here?
Thanks!
|
| Charles Elkan
|
35
|
 |
|
10-08-2007 02:07 AM ET (US)
|
|
> if I run the kNN algorithm on the data and test samples with k = 1, 3, > and 5 say, in order to see what kind of effect this might have on the > results, that I can't use the results of the "best k" to calculate final > accuracy? Yes.
> Should I use a smaller subset of the data to > experiment to determine the optimum k and then calculate final > accuracy by using that k on the remaining datasets? Yes.
> Or should I just research the problem online to determine what previous > people have concluded about k and use and document that? The same > question goes for the number of Epochs in the Voted Perceptron.
You could look into previous work on selecting k (or the number of epochs) but I don't know of anything very useful. Simply searching for which k is best on a valdation set is a reasonable strategy.
|
| Kristen Jaskie
|
34
|
 |
|
10-08-2007 12:51 AM ET (US)
|
|
You said in one previous message that "Final accuracy should be measured on test data that you did not use for your preliminary experiments!" Does that mean that if I run the kNN algorithm on the data and test samples with k = 1, 3, and 5 say, in order to see what kind of effect this might have on the results, that I can't use the results of the "best k" to calculate final accuracy? Should I use a smaller subset of the data to experiment to determine the optimum k and then calculate final accuracy by using that k on the remaining datasets? Or should I just research the problem online to determine what previous people have concluded about k and use and document that? The same question goes for the number of Epochs in the Voted Perceptron. You sort of talked about this in /m27 but I want to make sure that I understand. Thanks!
|
| Charles Elkan
|
33
|
 |
|
10-07-2007 11:04 AM ET (US)
|
|
> Is overfitting the same concept as "data snooping?"
Data snooping is more specific: it means looking at the test data more than once. Overfitting is the dabgerous possible result of data snooping. Overfitting can also be the result of other causes, for example having too many features relative to the number of training examples.
|
| Sean Park
|
32
|
 |
|
10-06-2007 09:46 PM ET (US)
|
|
Edited by author 10-06-2007 09:49 PM
So if I understood correctly, does overfitting mean almost the same as "data snooping?"
|
| Charles Elkan
|
31
|
 |
|
10-06-2007 08:40 PM ET (US)
|
|
> What exactly do you mean by warning us not to overfit the test > sets? I understand the basic concept in mathematics, but am > unsure as to how it relates to this project. We use the > training data to create a classifier for each algorithm as > discussed in class, and then once we have a classifier, we run > the test data through it and calculate for each test sample if > it was calculated correctly or incorrectly. Where does > overfitting fit in?
You run the risk of overfitting the test set if you use the same test set repeatedly. For example, suppose you use the test set to select the best number T of epochs. Randomly, some T values will do better than others on the test set. The accuracy you see with the best T is likely to be higher than the accuracy that the same T will give on a different test set. A test set that you ise repeatedly to choose a value for a parameter of an algorithm is called a validation set. If you use a validation set, you should have a separate genuine test set. A test set is genuine if you only ever use it once. Of course when you have a limited quantity of labeled data, it is very tempting to use it all for training in one way or another.
|
| Kristen Jaskie
|
30
|
 |
|
10-06-2007 06:36 PM ET (US)
|
|
What exactly do you mean by warning us not to overfit the test sets? I understand the basic concept in mathematics, but am unsure as to how it relates to this project. We use the training data to create a classifier for each algorithm as discussed in class, and then once we have a classifier, we run the test data through it and calculate for each test sample if it was calculated correctly or incorrectly. Where does overfitting fit in? Does that mean that we shouldn't use all 28x28 pixels as individual features? Thanks!
|
| Charles Elkan
|
29
|
 |
|
10-06-2007 12:03 PM ET (US)
|
|
> But for multiclassifier part, I can't think of a suitable algorithm. I > tried comparing magnitude of dot product of each w and testset. I made > each w unit vector so that w's length wouldn't affect the magnitude of > dot product. However, I don't really see much correlation with its > result (too low accuracy). Any suggestions? Your method is a good idea but sometimes good ideas don't work :-) Another idea is to pick the digit that wins the greatest number (maximum 9) of the 45 pairwise contests. You may need a tie-breaking idea. > Another question, since we have many training points and features, > applying NN will be extremely slow although it seems to be working very > accurately. Prof. Elkan mentioned about kd-tree data structure in class. > Is kd-tree a reasonable approach for speeding up? Is there any > alternative ways? No tree data structure will work at all in 784 dimensions. The only method I know that may be useful is called LAESA. The algorithm is very simple: See Figure 3 of "A modification of the LAESA algorithm for approximated k-NN classification" by Francisco Moreno-Seco, Luisa Mico, Jose Oncina, available at http://citeseer.ist.psu.edu/574425.html
|
| Sean Park
|
28
|
 |
|
10-06-2007 03:21 AM ET (US)
|
|
I understand that accuracy of 45 classifiers are representing how well the perceptron algorithm works for each pair of binary decisions. But for multiclassifier part, I can't think of a suitable algorithm. I tried comparing magnitude of dot product of each w and testset. I made each w unit vector so that w's length wouldn't affect the magnitude of dot product. However, I don't really see much correlation with its result (too low accuracy). Any suggestions?
Another question, since we have many training points and features, applying NN will be extremely slow although it seems to be working very accurately. Prof. Elkan mentioned about kd-tree data structure in class. Is kd-tree a reasonable approach for speeding up? Is there any alternative ways?
|
| Charles Elkan
|
27
|
 |
|
10-05-2007 05:11 PM ET (US)
|
|
> In perceptron part, how do we calculate the "final" accuracy? > What I did was that I built 45 classifiers (there are 45 pairs > in 10 digits) and obtained 45 accuracies. Should I average them?
Every learning algorithm has various parameters that the user needs to set, for example k for NN or the number of epochs for the voted perceptron. "Final" refers to the classifier you train using your preferred settings, as opposed to the classifiers you train in preliminary experiments.
Final accuracy should be measured on test data that you did not use for your preliminary experiments!
If you have 45 classifiers you can average their accuracies and get an estimate of how well the learning method does for making binary decisions. You can also define an algorithm to combine the outputs of the 45 classifiers, and then measure the accuracy of the combination at making 10-way decisions. This latter accuracy will presumably be much lower.
> I think my approach doesn't mean much since I have to know the > labels of test examples in advance. This is true for making a binary decision.
> ... classify ...each test example without pre-knowledge of > the label of the test example. This is what making a ten-way decision means.
|
| Charles Elkan
|
26
|
 |
|
10-05-2007 05:00 PM ET (US)
|
|
> My error rate on the test1+test7 examples, using a classifier trained on > train1+train7 over 2 epochs, is 0.0055 (or so). That's 99.45% accuracy.
I think this is reasonable. Making a binary choice between two digits is much easier than makinga ten-way choice, so I would expect you to get an error rate lower than published numbers.
|
| Matt Jacobsen
|
25
|
 |
|
10-05-2007 01:15 PM ET (US)
|
|
What are people getting for their voted perceptron accuracy on the 1 vs. 7 tests? I'm a bit concerned that something is amiss with my implementation. My error rate on the test1+test7 examples, using a classifier trained on train1+train7 over 2 epochs, is 0.0055 (or so). That's 99.45% accuracy. Since most of the posted error rates for various classifiers on the MNIST page boast error rates considerably higher than that, I feel like something might not be right. Can anyone confirm that this is the right ballpark?
Thanks, Matt
|
| Sean Park
|
24
|
 |
|
10-05-2007 12:39 PM ET (US)
|
|
Edited by author 10-05-2007 12:41 PM
In perceptron part, how do we calculate the "final" accuracy? What I did was that I built 45 classifiers (there are 45 pairs in 10 digits) and obtained 45 accuracies. Should I average them?
I don't think my approach mean much because I have to know the labels of test examples in advance. I'm wondering if I have to come up with some algorithm that finds correct hyperplane and classify according to it for each test example without pre-knowledge of the label of the test example. So that it classifies correctly when only one test example is entered. Is this what I supposed to? If so, can someone give me some hint so that I can go on? I can't think of a good way to do it at this moment.
If my explanation isn't clear, let me know I'll try to paraphrase it.
Thanks,
|
| Charles Elkan
|
23
|
 |
|
10-05-2007 12:28 PM ET (US)
|
|
> What does it mean by "non-automated way" here?
Non-automated means you don't have to invent (or implement) an algorithm. You just have to make a justified, sensible, documented choice.
|
| Charles Elkan
|
22
|
 |
|
10-05-2007 12:25 PM ET (US)
|
|
> I think the task at hand is to come up with an algorithm that > processes the training data and derives a k. This seems like > this would be an automated approach
No, you don't need to invent an algorithm for choosing k. You just need to use your human intelligence to select a reasonable value of k for this particular task. Do some preliminary experiments and choose a value for k based on these. In your report, explain concisely how you made your choice.
|
| Matt Rodriguez
|
21
|
 |
|
10-04-2007 10:25 PM ET (US)
|
|
After thinking about this some more, it probably means that we should pick a k value.
|
| Sean Park
|
20
|
 |
|
10-04-2007 08:42 PM ET (US)
|
|
What does it mean by "non-automated way" here?
|
| Stefano
|
19
|
 |
|
10-04-2007 07:24 PM ET (US)
|
|
Is there someone still looking for a partner for hw1? If so, send me an email at sbonisso@cs.ucsd.edu
-Stefano
|
| Matt Rodriguez
|
18
|
 |
|
10-04-2007 04:24 PM ET (US)
|
|
Hi everybody,
I am thinking about implementing the KNN algorithm and it seems like the hardest part of the task is to come up with a good way to pick k. I am looking over the project description and it says that we should come up with a "sensible, well-documented, non-automated way". I don't understand why it would be non-automated.
I think the task at hand is to come up with an algorithm that processes the training data and derives a k. This seems like this would be an automated approach once the algorithm has been coded and debugged. Sorry for nit-picking here, but I am little confused about this.
Cheers, Matt
|
| Cara Buck
|
17
|
 |
|
10-04-2007 02:51 PM ET (US)
|
|
Hi Jerry, I'm also looking for a partner for project 1.
|
| Charles Elkan
|
16
|
 |
|
10-04-2007 01:15 PM ET (US)
|
|
> My approach was to take each row of say the 0 training matrix as a vector Yes.
> I need to do it on multiple training sets (such as the > 1 and 7 training sets)
Yes. The training set must consists of some positive examples (e.g. the "1" examples) and some negative examples (e.g. the "7" examples). You assemble the training set based on what concept you want to learn. A concept is a way of distinguishing between two classes.
|
| Jerry Fu
|
15
|
 |
|
10-04-2007 12:23 PM ET (US)
|
|
I am having difficulty figuring out how to use perceptron on the handwritten digits. My approach was to take each row of say the 0 training matrix as a vector, and then combining then seeing if it "agrees" with the existing w. However, with this approach, if I run it on just one training set, it seems that after the first few inputs, the perceptron would never update since they are all positive examples. Is this the correct approach, but just that I need to do it on multiple training sets (such as the 1 and 7 training sets), or am I taking the wrong approach here?
|
| Charles Elkan
|
14
|
 |
|
10-03-2007 11:47 PM ET (US)
|
|
> Is it possible to run the "Online" Perception algorithm on the > training data? Just feed in one after another of the 6k > training sets and calculate w that way? Or would that bias the > model somehow since the order would no longer be random?
The online algorithm assumes an indefinitely large training set. If you have a finite training set, the obvious way to make it larger is to cycle through it repeatedly. Each once-through is called an epoch.
The convergence theorem is true regardless of the order of examples. However, a non-random order can cause problems in practice.
The perceptron algorithm is strictly for two-class problems. You can't use it "as is" on examples from three or more classes.
(If I didn't answer your question, please restate it!)
|
| Kristen Jaskie
|
13
|
 |
|
10-03-2007 09:36 PM ET (US)
|
|
Is it possible to run the "Online" Perception algorithm on the training data? Just feed in one after another of the 6k training sets and calculate w that way? Or would that bias the model somehow since the order would no longer be random?
|
| Charles Elkan
|
12
|
 |
|
10-03-2007 03:31 PM ET (US)
|
|
> What kind of results are people getting with the perceptron? I > am jumping up to 95% or more in the first four epochs (I update > w every training example but only run the test data once every > epoch). Then my accuracy falls off and oscillates around 75-80% > if I let it go too long. Is this from over training? Is anyone > getting a perfect classifier?
I get over 95% accuracy but not 100%. The fall-off to 75-80% sounds like a bug or at least a "gotcha". Are you processing the data in random order? If not, this could be an issue.
|
| Charles Elkan
|
11
|
 |
|
10-03-2007 03:29 PM ET (US)
|
|
> PROJECT 1: Is there an easy way to convert the matlab version > of the data into a format readable in R?
I did a quick Google search, but I didn't find anything useful. Most likely you can export the data from Matlab in ascii and then import the ascii version into R.
However, I suggest you use the opportunity to learn Matlab. Both R and Matlab are high-quality software packages, with complementary advantages. Knowing both is beneficial.
|
| Gideon Prior
|
10
|
 |
|
10-03-2007 02:37 PM ET (US)
|
|
What kind of results are people getting with the perceptron? I am jumping up to 95% or more in the first four epochs (I update w every training example but only run the test data once every epoch). Then my accuracy falls off and oscillates around 75-80% if I let it go too long. Is this from over training? Is anyone getting a perfect classifier?
|
| hannah rohde
|
9
|
 |
|
10-03-2007 01:58 PM ET (US)
|
|
PROJECT 1: Is there an easy way to convert the matlab version of the data into a format readable in R?
|
| Charles Elkan
|
8
|
 |
|
10-03-2007 12:13 PM ET (US)
|
|
|
| Charles Elkan
|
7
|
 |
|
10-03-2007 12:12 PM ET (US)
|
|
> that each digit is a 28x28 array of pixels, but when I try to > isolate that size of matrix within the data, I seem to end up > with all zero's in most cases. Conceptually each digit is 28 by 28, but for input to the algorithm each digit is one row of length 784 = 28*28. > Dasgupta's old hmwk for "usefull hints", but embarassingly Sorry, the link should be to Dasgupta's homework 0, not 1: http://www.cs.ucsd.edu/users/dasgupta/250B/hw0.pdfWhat you will find there is information about (a) Matlab's "image" function that lets you view digits and (b) How to compute Euclidean distances efficiently. Charles
|
| Gideon Prior
|
6
|
 |
|
10-03-2007 02:08 AM ET (US)
|
|
Can anyone explain how the k nearest neighbors algorithm works? Thanks
|
| Kristen Jaskie
|
5
|
 |
|
10-03-2007 01:37 AM ET (US)
|
|
I've never used MATLAB before today, but I'm trying to read in the MNIST handwritten digit data we were given for this project. I imported it into Matlab, but when I try and look at it I am having problems understanding how it is presented. You had said that each digit is a 28x28 array of pixels, but when I try to isolate that size of matrix within the data, I seem to end up with all zero's in most cases. I looked at the link to Dasgupta's old hmwk for "usefull hints", but embarassingly enough, I still seem to be pretty lost and don't seem to be making any progress in actually understanding the data or how to use it. Any suggestions or advice would be greatly appreciated!
|
| Jerry Fu
|
4
|
 |
|
10-02-2007 08:27 PM ET (US)
|
|
Is anyone looking for a partner for Project 1?
|
| Charles Elkan
|
3
|
 |
|
10-01-2007 05:44 PM ET (US)
|
|
> In Proj 1 there are a couple of issues that we're to discuss > briefly in the report. Can anyone clarify what exactly "the > effect of balanced versus unbalanced class cardinalities" means? > My basic understanding is that the 0 - 9 digits all have > balanced class cardinalities because there are 6K images for > each. But I have no idea if this is what's being > asked/referenced here.
You are correct. If we tried to distinguish one digit from all the other digits, then we would have unbalanced cardinalities, i.e. 10% versus 90%. You can delay thinking about this issue until the rest of your work is all done. Then, you can do some experiments where you use all 600 images for one digit versus a random sample of (for example) 1200 images of the other digit. Maybe the resulting classifier is biased towards predicting that an ambiguous digit belongs to the first class. Or maybe not?
> Furthermore, can someone clarify the subsequent point about > "whether the training and test sets follow the same probability > distribution"? It would seem that we don't know if the training > and test sets came from the same distribution or not. Also, I'm > not sure what it means to ask if they "follow" the same > probability distribution.
"Follow the same distribution" is just a way of saying that they are random samples from the same distribution.
Look at the original documentation for the data to find out exactly how the training and test sets were created. Are they both random samples of digits written by the same people, or by different people?
|
| Matt Jacobsen
|
2
|
 |
|
10-01-2007 05:09 PM ET (US)
|
|
In Proj 1 there are a couple of issues that we're to discuss briefly in the report. Can anyone clarify what exactly "the effect of balanced versus unbalanced class cardinalities" means? My basic understanding is that the 0 - 9 digits all have balanced class cardinalities because there are 6K images for each. But I have no idea if this is what's being asked/referenced here.
Furthermore, can someone clarify the subsequent point about "whether the training and test sets follow the same probability distribution"? It would seem that we don't know if the training and test sets came from the same distribution or not. Also, I'm not sure what it means to ask if they "follow" the same probability distribution.
Thanks.
|
Charles Elkan
|
1
|
 |
|
09-27-2007 07:47 PM ET (US)
|
|
Please ask all questions of general interest about CSE 250B here. In particular, please ask questions about the lectures and about the projects.
|