| Who | When |
Messages | |
|
|
|
| Kristen Jaskie
|
144
|
 |
|
12-11-2007 11:45 PM ET (US)
|
|
My partner and I have been working super hard to finish this project early because we both have other deadlines and exams this week. We finished the code (using the Viterbi algorithm) this weekend but have spent the two days since trying to run the experiments. Our problem is that we keep running out of memory (even on my partner's brand new laptop optimized for performance with enhanced virtual memory in the control panel). We tried cutting down our number of feature functions to a minimum and performing garbage collection wherever possible. But while we can run the code just fine if we use 1% of the data for training, attempting to use even 5% causes the program to fail due to lack of memory. We have spent WAY too much time on this problem and are unsure how to proceed. We obveously cannot do 5-way crossvalidation. Should we just attempt 100-way crossvalidation (using 1% at a time to train and the rest to test?) We started early but we're down to the nub now and don't know what to do. Any suggestions?
|
| Sean Park
|
145
|
 |
|
12-11-2007 11:51 PM ET (US)
|
|
Same thing is happening to us. Do not really know what to do as well.
|
| Jerry Fu
|
146
|
 |
|
12-12-2007 01:17 AM ET (US)
|
|
how many feature functions are you using?
|
| Kristen Jaskie
|
147
|
 |
|
12-12-2007 01:20 AM ET (US)
|
|
We were using over a thousand but we cut out the ones that didn't seem very useful during preliminary testing so now we are only using 740.
|
| Charles Elkan
|
148
|
 |
|
12-12-2007 01:24 AM ET (US)
|
|
Edited by author 12-12-2007 01:25 AM
/m144 answer: I am surprised because the Viterbi algorithm only needs O(m*n) space where m is the number of tags and n is the length of the word. With m <= 4 and n <= 10 this should not be a problem! If you can run ok with training based on 100 words but not 500, it seems you are not freeing up memory after each processing each word. Probably you need to find and fix this memory leak. One way to clean up your code to reduce memory usage is to encapsulate the body of an outer loop into a function. This can help because all memory used by local variables of the function is released every time you return from the function. Have you used the Matlab profiler to see where your code is slow? That may be where the memory leak is. Unfortunately the profiler does not directly tell you about memory usage. You may find useful hints here: http://www.mathworks.com/support/tech-notes/1100/1106.html
|
| Charles Elkan
|
149
|
 |
|
12-12-2007 01:30 AM ET (US)
|
|
What would be an "acceptable" number of feature functions to have? I'm asking because I'm not sure we will have time to put in thousands or even hundreds of functions for the final project. We're currently working with a small number of feature functions as we develop and test our algorithms, but we expect to expand beyond those that we currently have, although it might be difficult for us to get more than 50 or so. Another concern here is how well the project would work without that many feature functions, and I guess I'll find that out once we're able to start testing this.
I think for good performance you need at least a few hundred feature functions. However, I will be happy if you show that I am wrong!
This project is an educational experience, so if you do everything well except for some reason you have under 100 feature functions, you will still get a very good score. In your report, do explain briefly why you have not many feature functions, and what the bottleneck(s) is/are preventing you from having more.
If you do have 100s or 1000s of feature functions, then experiments using different subsets of these will be very interesting.
|
| Kristen Jaskie
|
150
|
 |
|
12-12-2007 01:31 AM ET (US)
|
|
The Viterbi algorithm only needs O(m*n) space per word, but for time complexity reasons, we were computing the Viterbi algorithm for all the words together and storing it in a matrix of size O(m*n*numwords). Also, it's necessary to compute g which is of size(numwords*lengthOfLongestWord*4) and to do that, we have to compute four more f matrices (one for 00, 01, 10, 11) which are each (numwords*lengthOfLongestWord*numFeatures) in size. All together there are ton of data structures that have to be stored even to compute U. Unless we are totally confused and doing something wrong...
|
| Charles Elkan
|
151
|
 |
|
12-12-2007 01:38 AM ET (US)
|
|
for time complexity reasons, we were computing the Viterbi algorithm for all the words together and storing it in a matrix of size O(m*n*numwords).
If you are using the perceptron method (or stochastic gradient) this is not a good idea, because the weight vector changes after processing each word. So you have to run Viterbi each word at a time.
we have to compute four more f matrices (one for 00, 01, 10, 11) which are each (numwords*lengthOfLongestWord*numFeatures) in size. These matrices can be stored as 'sparse' in Matlab, because most feature-functions are 0, for every position of every word.
|
| Klinton Bicknell
|
152
|
 |
|
12-12-2007 01:50 AM ET (US)
|
|
I don't understand the question about tuning numerical thresholds for predictions. I'm not sure what part of Viterbi could have a threshold.
|
| Charles Elkan
|
153
|
 |
|
12-12-2007 02:25 AM ET (US)
|
|
/m152: You are right, there are no thresholds in the Viterbi algorithm. Generally you might find some way of tuning the training and/or prediction algorithm to produce overall more or fewer hyphens. But I can't say I know a good way to do this. More generally, per-letter accuracy, per-word accuracy, and log-likelihood are all different objective functions. A research question is how to tune CRF-type methods in order to maximize (on test data!) whatever objectove makes most sense for the application.
|
Jerry Fu
|
154
|
 |
|
12-12-2007 03:32 AM ET (US)
|
|
I am a bit confused now. Aren't the weights set before you ever run Viterbi? so you would go through the entire dataset using perceptron to train weights, then use Viterbi to determine the labellings?
|
| Charles Elkan
|
155
|
 |
|
12-12-2007 09:57 AM ET (US)
|
|
/m154: Yes, Viterbi assumes the weights are given. However, these weights are updated by each step of the perceptron. In each epoch through the training data, when you run Viterbi on example i+1, you need to use the weights updated based on example i. Running Viterbi on every example only at the beginning of each epoch might conceivably work. This would be an interesting experiment to try. To get a useful answer, you would have to adjust the learning rate separately for each method.
|
| Klinton Bicknell
|
156
|
 |
|
12-13-2007 12:41 PM ET (US)
|
|
At the beginning of the quarter, you said in the response to one of the messages on this board that R and Matlab had complementary advantages, and it was good to know both. Yet, I think all the projects we've done have strongly recommended using matlab. I was just curious what you saw as the relative strengths and weaknesses of the two. (Obviously, this is a low-priority question, though =)
|
| Charles Elkan
|
157
|
 |
|
12-13-2007 12:59 PM ET (US)
|
|
I did not intend to advocate Matlab strongly over R. Personally I happen to be more accustomed to Matlab. AFAIK Matlab has a wider range of users and of libraries for all of science and engineering, including machine learning. On the ther hand R is more elegant and more widely used by the best researchers in statistics. I'm sure there are other important differences also.
One important question is which handles large datasets better. I don't know the answer to that.
On Thu, 13 Dec 2007, QT - Klinton Bicknell wrote:
> < replied-to message removed by QT >
|
Jerry Fu
|
158
|
 |
|
12-14-2007 06:57 AM ET (US)
|
|
I don't know if anyone is still reading this, but I discovered an interesting "trick" in MATLAB. I'm sure most of you found that computing the feature function values took quite a while. So I decided to try splitting up the training set, so that I calculated all the features for 500 examples at a time (passing only the 500 examples into the function), and just appended the calculated arrays. This was blazing fast compared to calculating the entire set at once...so that's something to try if you have this problem in the future.
|
| Sean Park
|
159
|
 |
|
12-17-2007 04:23 AM ET (US)
|
|
Thanks Jerry
|