QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: CSE 250B Fall 2007
Views: 3058, Unique: 550 
Subscribers: 9
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
About these ads
Who | When
Messagessort recent-bottom   
Post a new message
 
Sean Park  159
12-17-2007 04:23 AM ET (US)
Thanks Jerry
Jerry FuPerson was signed in when posted  158
12-14-2007 06:57 AM ET (US)
I don't know if anyone is still reading this, but I discovered an interesting "trick" in MATLAB. I'm sure most of you found that computing the feature function values took quite a while. So I decided to try splitting up the training set, so that I calculated all the features for 500 examples at a time (passing only the 500 examples into the function), and just appended the calculated arrays. This was blazing fast compared to calculating the entire set at once...so that's something to try if you have this problem in the future.
Charles Elkan  157
12-13-2007 12:59 PM ET (US)
I did not intend to advocate Matlab strongly over R. Personally I happen to be more accustomed to Matlab. AFAIK Matlab has a wider range of users and of libraries for all of science and engineering, including machine learning. On the ther hand R is more elegant and more widely used by the best researchers in statistics. I'm sure there are other important differences also.

One important question is which handles large datasets better. I don't know the answer to that.



On Thu, 13 Dec 2007, QT - Klinton Bicknell wrote:

>
< replied-to message removed by QT >
Klinton Bicknell  156
12-13-2007 12:41 PM ET (US)
At the beginning of the quarter, you said in the response to one of the messages on this board that R and Matlab had complementary advantages, and it was good to know both. Yet, I think all the projects we've done have strongly recommended using matlab. I was just curious what you saw as the relative strengths and weaknesses of the two. (Obviously, this is a low-priority question, though =)
Charles Elkan  155
12-12-2007 09:57 AM ET (US)
/m154: Yes, Viterbi assumes the weights are given. However, these weights are updated by each step of the perceptron. In each epoch through the training data, when you run Viterbi on example i+1, you need to use the weights updated based on example i.

Running Viterbi on every example only at the beginning of each epoch might conceivably work. This would be an interesting experiment to try. To get a useful answer, you would have to adjust the learning rate separately for each method.
Jerry FuPerson was signed in when posted  154
12-12-2007 03:32 AM ET (US)
I am a bit confused now. Aren't the weights set before you ever run Viterbi? so you would go through the entire dataset using perceptron to train weights, then use Viterbi to determine the labellings?
Charles Elkan  153
12-12-2007 02:25 AM ET (US)
/m152: You are right, there are no thresholds in the Viterbi algorithm.

Generally you might find some way of tuning the training and/or prediction algorithm to produce overall more or fewer hyphens. But I can't say I know a good way to do this.

More generally, per-letter accuracy, per-word accuracy, and log-likelihood are all different objective functions. A research question is how to tune CRF-type methods in order to maximize (on test data!) whatever objectove makes most sense for the application.
Klinton Bicknell  152
12-12-2007 01:50 AM ET (US)
I don't understand the question about tuning numerical thresholds for predictions. I'm not sure what part of Viterbi could have a threshold.
Charles Elkan  151
12-12-2007 01:38 AM ET (US)
for time complexity reasons, we were computing the Viterbi algorithm for all the words together and storing it in a matrix of size O(m*n*numwords).

If you are using the perceptron method (or stochastic gradient) this is not a good idea, because the weight vector changes after processing each word. So you have to run Viterbi each word at a time.

we have to compute four more f matrices (one for 00, 01, 10, 11) which are each (numwords*lengthOfLongestWord*numFeatures) in size.
These matrices can be stored as 'sparse' in Matlab, because most feature-functions are 0, for every position of every word.
Kristen Jaskie  150
12-12-2007 01:31 AM ET (US)
The Viterbi algorithm only needs O(m*n) space per word, but for time complexity reasons, we were computing the Viterbi algorithm for all the words together and storing it in a matrix of size O(m*n*numwords). Also, it's necessary to compute g which is of size(numwords*lengthOfLongestWord*4) and to do that, we have to compute four more f matrices (one for 00, 01, 10, 11) which are each (numwords*lengthOfLongestWord*numFeatures) in size. All together there are ton of data structures that have to be stored even to compute U. Unless we are totally confused and doing something wrong...
Charles Elkan  149
12-12-2007 01:30 AM ET (US)
What would be an "acceptable" number of feature functions to have?
I'm asking because I'm not sure we will have time to put in thousands
or even hundreds of functions for the final project. We're currently
working with a small number of feature functions as we develop and
test our algorithms, but we expect to expand beyond those that we
currently have, although it might be difficult for us to get more than
50 or so. Another concern here is how well the project would work
without that many feature functions, and I guess I'll find that out
once we're able to start testing this.


I think for good performance you need at least a few hundred feature functions. However, I will be happy if you show that I am wrong!

This project is an educational experience, so if you do everything well except for some reason you have under 100 feature functions, you will still get a very good score. In your report, do explain briefly why you have not many feature functions, and what the bottleneck(s) is/are preventing you from having more.

If you do have 100s or 1000s of feature functions, then experiments using different subsets of these will be very interesting.
Charles Elkan  148
12-12-2007 01:24 AM ET (US)
Edited by author 12-12-2007 01:25 AM
/m144 answer: I am surprised because the Viterbi algorithm only needs O(m*n) space where m is the number of tags and n is the length of the word. With m <= 4 and n <= 10 this should not be a problem!

If you can run ok with training based on 100 words but not 500, it seems you are not freeing up memory after each processing each word. Probably you need to find and fix this memory leak.

One way to clean up your code to reduce memory usage is to encapsulate the body of an outer loop into a function. This can help because all memory used by local variables of the function is released every time you return from the function.

Have you used the Matlab profiler to see where your code is slow? That may be where the memory leak is. Unfortunately the profiler does not directly tell you about memory usage.

You may find useful hints here: http://www.mathworks.com/support/tech-notes/1100/1106.html
Kristen Jaskie  147
12-12-2007 01:20 AM ET (US)
We were using over a thousand but we cut out the ones that didn't seem very useful during preliminary testing so now we are only using 740.
Jerry Fu  146
12-12-2007 01:17 AM ET (US)
how many feature functions are you using?
Sean Park  145
12-11-2007 11:51 PM ET (US)
Same thing is happening to us.
Do not really know what to do as well.
Kristen Jaskie  144
12-11-2007 11:45 PM ET (US)
My partner and I have been working super hard to finish this project early because we both have other deadlines and exams this week. We finished the code (using the Viterbi algorithm) this weekend but have spent the two days since trying to run the experiments. Our problem is that we keep running out of memory (even on my partner's brand new laptop optimized for performance with enhanced virtual memory in the control panel). We tried cutting down our number of feature functions to a minimum and performing garbage collection wherever possible. But while we can run the code just fine if we use 1% of the data for training, attempting to use even 5% causes the program to fail due to lack of memory. We have spent WAY too much time on this problem and are unsure how to proceed. We obveously cannot do 5-way crossvalidation. Should we just attempt 100-way crossvalidation (using 1% at a time to train and the rest to test?) We started early but we're down to the nub now and don't know what to do. Any suggestions?
Charles Elkan  143
12-11-2007 03:11 PM ET (US)
Some additional notes on what to put in your report

Make sure you yourself, and the reader, are convinced that your algorithms and your implementation are correct. We want to be able to believe and trust your experimental results.

In the report, make it very clear exactly what feature-functins you use. For each different type of feature-function, explain it explicitly and say how many feature-functions there are of this type.

Examine and discuss your final trained CRF model. Tabulate which feature-functions have the largest learned weights (positive or negative). Use your intuition to explain, in retrospect, why these feature-functions have the most predictive power.
Charles Elkan  142
12-11-2007 12:15 AM ET (US)
a feature function that examines only vowel consonant patterns and hyphenates a particular letter in that sequence does not change based on the previous tag. so when performing the Viterbi algorithm, the value of this feature function would be multiplied by its weight and added in , correct?

Yes.

If we have a feature function that examines the previous tag, we can precalculate all possible tag combinations in preprocessing, and then operate on these vectors instead of calculating these on the fly.
Yes.
Jerry Fu  141
12-10-2007 11:30 PM ET (US)
I have a questions about feature functions, specifically how the lecture notes example pertains to our project. the function U(k,v) maximizes over the tags y at each step, which are either 1 or 0 in our project. However, a feature function that examines only vowel consonant patterns and hyphenates a particular letter in that sequence does not change based on the previous tag. so when performing the Viterbi algorithm, the value of this feature function would be multiplied by its weight and added in , correct?

If we have a feature function that examines the previous tag, we can precalculate all possible tag combinations in preprocessing, and then operate on these vectors instead of calculating these on the fly.
Charles Elkan  140
12-09-2007 05:25 PM ET (US)
Reminder: The final exam will be on Thursday at 3pm, in WLH 2208, which is the same room that lectures were in.
Charles Elkan  139
12-09-2007 12:05 PM ET (US)
Project simplifications

Here are some suggestions and simplifications that should make the project easier.

(1) Implement perceptron training first. With this method, you do not need to compute the partition function, and you do not need to compute derivatives. All you need is the argmax.

(2) If you assume that no hyphen is allowed after the first letter or before the last letter, then you don't need START and STOP tags, so the cardinality of the tag set is just 2. With this cardinality, you can compute the argmax, Z, and derivatives all by brute force (i.e. enumerating all possible tag sequences). Implement this first and use it to do the project. Then, if you have time, use it to check the correctness of your code that uses sophisticated algorithms.

(3) Your priorities should be clearness, correctness, and usefulness. A clear, correct, useful method is far more desirable than a fancy method whose correctness is not clear and whose results are questionable.

(4) There is an 80-20 rule in everything. Best performance might come from 10,000 feature-functions but you can get almost as good performance with many fewer. Start with few enough so that your code is fast enough that you can debug it easily.

(5) Perception is reality. First impressions matter. Presentation counts. A good report about a good method is better than a bad report about a method that might be very good.

(6) Many problems are too hard to solve, except with simplifications. Having the judgment to invent potential simplifications, and then to select reasonable ones, is a very important skill.

In summary, I will be happy to give full marks to a good report about a method that clearly does solve the problem, even if the method itself is simplified. Specifically, you can get full marks based on just just one training method.
Charles Elkan  138
12-08-2007 06:33 PM ET (US)
I appreciate that the last project is the most complex one, and that many people have multiple other assignments due also. I will certainly be generous in scoring. Let me repeat also that scoring is not grading, so lowish scores can still lead to a good grade. The projects and exams are challenging!

I have been covering log-linear models and CRFs since November 27. I understand that many people may not have been able to focus on the project until now. I will try to be very available to answer questions on this message board.
Jostein Oeysaed  137
12-08-2007 04:57 PM ET (US)
Hi Charles,

I'm feeling kind of frustrated because this last project seems much more difficult than the previous ones, and you didn't go over the stuff we needed to know until Thursday. My partner and I started early and were trying to do it before, but haven't made any progress till now since you hadn't gone over what we needed to know. We still have a LONG way to go (it's pretty confusing stuff) and now we only have a couple of days to do the entire thing in. The problem is that I have several other finals and projects next week, so I just don't know if we'll have time to do a good job AND to study for my other classes.

I think the week we had off due to the fires messed up your schedule, but I don't think it's fair for you to make us work like this during finals week. We really didn't have the material that we needed to do the project until the very last day of class.

It would help if you could be a little bit easier in grading this project, due to the lack of time we have to do it (like how you "graded" the first project :)
Charles Elkan  136
12-07-2007 01:38 AM ET (US)
I have updated the online lecture notes on CRFs, which are here: http://www-cse.ucsd.edu/users/elkan/250B/loglinear.pdf

Also, for tutorials on CRFs see
http://www-cse.ucsd.edu/~elkan/254/papers.html
Charles Elkan  135
12-05-2007 10:24 PM ET (US)
Edited by author 12-05-2007 10:25 PM
how to represent feature-functions in matlab

Given fixed xbar and ybar, each F_i value is a number. So you can represent all F_i as one vector for xbar and ybar. This vector is sparse since F_i(xbar,ybar) = 0 for most i. The F_i vector can be computed once and for all for each training example, before you start training.

How you compute the F_i vectors, i.e. how you go from definitions like f(y_i-1, y_i, xbar, i) = I(y_i = 1)*I(<x_i-1 x_i x_i+1 x_i+1> = vowel-consonant-vowel-consonant) to the numerical values, is a separate question. You may want to do this preprocessing in a scripting language.
Charles Elkan  134
12-05-2007 10:12 PM ET (US)

m is "... the cardinality of the set of labels." So, for this project it should be 2 since the each label of a character is binary, correct?. This creates 2x2 matrices for the multiplications

Yes, because there are just two labels: 0 for no hyphen, and 1 for hyphen. If you also have labels "begin" and "end" then m=4.


I don't quite think my understanding of this is correct since for a feature function with different arity like g_t(y_{i-1}, y_i, y_{i+1}), this doesn't work. How are different arity feature functions dealt with, and what is the correct value for m?

The arity has nothing to do with m, the number of different label values possible at a single position. The restriction to arity = 2 is fundamental to the CRFs we are studying. The reason is precisely that the Viterbi and matrix algorithms only apply when arity = 2.
Kristen Jaskie  133
12-05-2007 05:32 PM ET (US)
Deleted by author 12-05-2007 08:19 PM
Stefano  132
12-05-2007 03:14 PM ET (US)
I have a question about computing the partition function.

The matrix multiplications of M_t(v,w), are for n different m-x-m matrices. m is "... the cardinality of the set of labels." So, for this project it should be 2 since the each label of a character is binary, correct?. This creates 2x2 matrices for the multiplications, which seem to work out for the definition as g_t(y_{i-1},y_i).

I don't quite think my understanding of this is correct since for a feature function with different arity like g_t(y_{i-1}, y_i, y_{i+1}), this doesn't work. How are different arity feature functions dealt with, and what is the correct value for m?
Charles Elkan  131
12-04-2007 01:40 PM ET (US)
Edited by author 12-04-2007 01:41 PM
I'm a bit confused about what function to use with Carl Rasmussen's non-linear optimizer. The example shows it being used with the Rosenbrock function which returns the function value, the partial derivatives, and the Hessian of the (general dimension) function. I'm guessing however that we only need to supply it with a function that returns the function value and the partial derivatives, right?

Correct. Use checkgrad.m to verify that the partial derivatives you supply are correct.


Then what might this function be? ... should we write one? Perhaps one that calculates the log likelihood? But what would the partial derivatives be?

Yes, you should supply the log-likelihood as your function to maximize. The partial derivatives are given in my lecture notes on page 7.
Matt Jacobsen  130
12-04-2007 01:57 AM ET (US)
I'm a bit confused about what function to use with Carl Rasmussen's non-linear optimizer. The example shows it being used with the Rosenbrock function which returns the function value, the partial derivatives, and the Hessian of the (general dimension) function. I'm guessing however that we only need to supply it with a function that returns the function value and the partial derivatives, right?

Then what might this function be? Would the Resenbrock function be appropriate? Or should we write one? Perhaps one that calculates the log likelihood? But what would the partial derivatives be?

I'm a bit lost on this. Any help would be appreciated.

Thanks,
Matt
Charles Elkan  129
12-03-2007 10:58 PM ET (US)
<i>
By placing hyphens in novel english words correctly do you mean
breaking the word into its syllables?
<\i>
Generally a hyphen is only allowed between syllables, but some syllable breaks are not desirable hyphen places. For example, you don't want a hyphen after the first letter of "asymptomatic."

The task here is shallow: just to learn to predict hyphen locations as they are found in the training data. Syllabification is a deeper, related, task.
Matt Rodriguez  128
12-03-2007 09:21 PM ET (US)
Quick question about the project

By placing hyphens in novel english words correctly do you mean breaking the word into its syllables?

Thanks,
Matt
Charles Elkan  127
12-03-2007 03:37 PM ET (US)

Are you going to be covering the Viterbi algorithm and matrix
multiplication algorithm in class? The paper is kind of
confusing. In particular, we are having trouble understanding
U(k,v).

Yes, these two algorithms will be the main topic tomorrow, Tuesday.
U(k,v) is an entry in a 2D matrix. k is an integer between 1 and the length of ybar. v is one of the possible values for y_k. Each value U(k,v) is computed based on the values in the previous row,
which is U(k-1,:).

Note the typo. in the first equation on page 5 of the notes:
y_k - 1 should be y_{k-1}.
Kristen Jaskie  126
12-03-2007 02:59 PM ET (US)
Are you going to be covering the Viterbi algorithm and matrix multiplication algorithm in class? The paper is kind of confusing. In particular, we are having trouble understanding U(k,v).
Charles Elkan  125
11-30-2007 04:42 PM ET (US)
Lecture notes on log-linear models and conditional random fields are available now at http://www.cs.ucsd.edu/users/elkan/250B/loglinear.pdf

Questions are welcome here!
Charles Elkan  124
11-27-2007 10:35 AM ET (US)

1) When we tried T = 20 and 10, the process takes only a few iterations to converge (i.e. the euclidean distance between subsequent thetas (or f) array falls below a threshold of 10^-6). Usually, if we follow the 20 or 10 with 8, 6, or 5, the EM runs many iterations before converging (several hundred to 1000). While it's iterating, the E(LL) is always increasing, but the change in theta usually starts out decreasing, but then starts to increase before falling below the threshold of 10^-6. Then suddenly, it'll just fall off. After that, the remaining values of T all converge quickly (< 100 iterations).

I am not surprised by the above. When T is very high (what this is is problem-dependent) then the LL surface is vey smooth and convergence is fast.

We are not supposed to do anything to theta between T values right? we don't have to scale or multiply them by the root that we were taking it at do we.

Correct.

2) the theta that we get frrom deterministic annealing has the same probability distribution at each position across all letters.

This may be because you started with T too high and the LL surface was flat.


Should be be stopping iterations once the change in theta increases for a given value of T? I'm wondering if maybe we let EM run "too far."

No.


UPDATE: I tried T = [2 1], and we seemed to get reasonable values for theta when I let that run to convergence.

Good!
Jerry Fu  123
11-27-2007 02:33 AM ET (US)
Edited by author 11-27-2007 02:41 AM
When we run our EM algorithm with deterministic annealing, we're noticing a strange (possibly) things.

1) When we tried T = 20 and 10, the process takes only a few iterations to converge (i.e. the euclidean distance between subsequent thetas (or f) array falls below a threshold of 10^-6). Usually, if we follow the 20 or 10 with 8, 6, or 5, the EM runs many iterations before converging (several hundred to 1000). While it's iterating, the E(LL) is always increasing, but the change in theta usually starts out decreasing, but then starts to increase before falling below the threshold of 10^-6. Then suddenly, it'll just fall off. After that, the remaining values of T all converge quickly (< 100 iterations).

We are not supposed to do anything to theta between T values right? we don't have to scale or multiply them by the root that we were taking it at do we.

2) the theta that we get frrom deterministic annealing has the same probability distribution at each position across all letters.

Should be be stopping iterations once the change in theta increases for a given value of T? I'm wondering if maybe we let EM run "too far."

UPDATE: I tried T = [2 1], and we seemed to get reasonable values for theta when I let that run to convergence.
Charles Elkan  122
11-26-2007 03:42 PM ET (US)

How many local optima can there be in terms of N number of training examples, W length of subsequence, and L number of letters?


I'm pretty sure the answer is an exponential number. It has been proved that with a certain formalization of the task, finding the globally optimal motif is NP-hard.
Charles Elkan  121
11-26-2007 03:40 PM ET (US)
Edited by author 11-26-2007 03:40 PM

what does it mean to get Z_i1 > 0.50 for a particular protein?


Z values are not defined for whole proteins. They are just defined for subsequences. Z(protein i, position j) > 0.5 means over 50% chance that *this* subsequence is a motif instance.


Lastly, the project description says that we should implement a streamlined version of MEME that fixes any mathematical errors. I've been busy implementing what is described, although I've left out some thing like erasing that we don't need. But does anyone have suggestions or hints on what needs to be fixed with the original implementation?

"Streamlined" means only one of OOPS, ZOOPS, etc., no multiple motifs, no automatic width determination, etc. "fixes any mathematical errors" just means beware that some notation or other details may be wrong in the original paper, just as it may be wrong in any published paper. Our (my) understanding of EM is better now than it was in 1996 (but still imperfect!).
Charles Elkan  120
11-26-2007 03:34 PM ET (US)

Are we supposed to normalize the Z_i1 vector for all subsequences of length W such that the sum over each subsequence is <= 1? This is described as the z_ij matrix (small z matrix) in the paper.


You do not need to do this normalization. It is important in practice for DNA motifs. It is less important for protein motifs. You do not need to do it now partly because it is just a heuristic. Some recent research has provided an alternative approach to avoiding motifs that self-overlap that is more mathematically justified, but that is far outside the scope of this project.
Charles Elkan  119
11-26-2007 03:31 PM ET (US)
Edited by author 11-26-2007 03:31 PM

The log-likelihood is slightly worse, at the end, compared to not using DA. Using DA and MEME's theta1, value is -65223. Not using DA, but using MEME's theta1, value is -65072. Using our own initial theta1 with and without DA, our values are the same, -65015.

There is no guarantee that any particular DA schedule is useful for any particular task, so these results don't necessarily need an explanation.


When starting with our initial theta1, the log-likelihood values are always increasing with every iteration. However, when starting with MEME's theta1 this is not true. During the t=10 stage, the first 6 iterations actually decrease the log-likelihood (approaching -65223). If I didn't see that our own theta1 + DA experiment had always increased log-likelihood, I'd expect this could happen because of the soft assignments.

I think it is possible that the extra-soft assignments used by DA may cause the log-likelihood not to increase at every iteration. This phenomenon could show up for some uses of DA but not others.


But now I'm very confused. Why would log-likelihood decrease when run using MEME's theta1?

DA is one candidate explanation. Another may be that standard EM is only guaranteed to converge to a saddle point, not necessarily to a local maximum. Unfortunately I don't know any good readable full analysis of EM's convergence properties. This is frustrating because without a full theoretical analysis, it's hard to know whether the code might have a bug.
Matt Rodriguez  118
11-26-2007 03:03 PM ET (US)
How many local optima can there be in terms of N number of training examples,W length of subsequence, and L number of letters? Can this be determined?

I know in a polynomial of root n there can be n -1 local optima, but this is a lot more complicated than that.

Cheers
Jerry Fu  117
11-26-2007 02:21 PM ET (US)
This was mentioned earlier in the board, but what does it mean to get Z_i1 > 0.50 for a particular protein? Does it mean that some subsequence of length W from that protein had a value > 0.5?

Also, do we need to look at value from the Z matrix? I suppose it is one way of figuring out which subsequence belong to the motif distribution, but this information is factored into theta already (except that theta only gives a probability distribution of letters and positions).

Speaking of theta, I know people have said that the values are not as extreme as MEME, but in my theta, the highest values that I am getting are about 0.2 (for any given letter in a given position)...that doesn't seem very extreme at all.

Lastly, the project description says that we should implement a streamlined version of MEME that fixes any mathematical errors. I've been busy implementing what is described, although I've left out some thing like erasing that we don't need. But does anyone have suggestions or hints on what needs to be fixed with the original implementation? The one thing I see so far changing is not using the z_ij as described in the paper, because it ends up being pretty sparse, and just working with a vector of the subsequences. This doesn't really streamline too much though, besides not needing to create an extra matrix.
Jerry Fu  116
11-26-2007 02:01 PM ET (US)
"Are we supposed to normalize the Z_i1 vector for all subsequences of length W such that the sum over each subsequence is <= 1? This is described as the z_ij matrix (small z matrix) in the paper. Are we supposed to use this z_ij matrix (small z) during the M-step instead of the Z_ij (big Z) matrix for j=1,2 and i=1..n?

In the paper this is done to prevent the EM from converging to simple motifs, but this then breaks the property making each a probability, Z_i1 + Z_i2 = 1 (unless we're supposed to re-scale Z_i2 as well?) Or is there some other way to prevent convergence to simple motifs?

I had thought we did it correctly, until I re-read this part. Could someone clarify? Thanks!"

I think we are supposed to normalize it. But one catch is that the way this is described in the paper, it only normalizes over zij that are on the same line (i.e. from the same protein). My program has all the subsequences of length W stored in a vector like you mentioned (referred to as X in the paper), but I also create the matrix z_ij as described in the paper using the vector X and the lengths of each protein in the dataset Y. The thing is, I use z_ij only to normalize...just about everything else is performed using X, so I think I could probably write a loop to normalize X directly. I think the loops would have to use the lengths though, because in the array X you can't tell when you've wrapped around to a new line of z_ij (i.e. a new protein).

As for Zi2, I though we can get away without normalizing it, because I thought it is only used only for the recalculating f0...but I just noticed it is also being used for lambda. so maybe we do need to normalize that one also.
Matt Jacobsen  115
11-26-2007 12:29 PM ET (US)
<i.More alarmingly, when I start with MEME's "letter-probability matrix" as my initial theta1, but run EM with a 10,1 deterministic annealing T schedule, I get a Z_ij matrix that attributes absolutely all subsequences to the background model (i.e. all Z_i1 are 0 and Z_i2 are 1). This is even more confusing.

Yes, this is strange. Is the log-likelihood increasing at every iteration; Is the final log-likelihood better than without DA?</i>

The log-likelihood is slightly worse, at the end, compared to not using DA. Using DA and MEME's theta1, value is -65223. Not using DA, but using MEME's theta1, value is -65072. Using our own initial theta1 with and without DA, our values are the same, -65015.

When starting with our initial theta1, the log-likelihood values are always increasing with every iteration. However, when starting with MEME's theta1 this is not true. During the t=10 stage, the first 6 iterations actually decrease the log-likelihood (approaching -65223). If I didn't see that our own theta1 + DA experiment had always increased log-likelihood, I'd expect this could happen because of the soft assignments. But now I'm very confused. Why would log-likelihood decrease when run using MEME's theta1?

Thanks,
Matt
Charles Elkan  114
11-26-2007 02:05 AM ET (US)

Using the "letter-probability matrix" as the initial theta1 (along with all my other initial parameters) and no annealing, I converge to the same results as MEME in just 2 iterations. When I calculate the sum log likelihood after this convergence, I get a value of -65223, whereas my not-so-crisp results produce a sum log likelihood of -65015 consistently.

Yes, what you are doing is the best way of finding the log-likelihood of the MEME output. It seems your model has better log-like. than the MEME output. A plausible explanation is that MEME is tuned to find crisp motifs; maximizing likelihood is not the perfect objective from a biological perspective.


More alarmingly, when I start with MEME's "letter-probability matrix" as my initial theta1, but run EM with a 10,1 deterministic annealing T schedule, I get a Z_ij matrix that attributes absolutely all subsequences to the background model (i.e. all Z_i1 are 0 and Z_i2 are 1). This is even more confusing.

Yes, this is strange. Is the log-likelihood increasing at every iteration; Is the final log-likelihood better than without DA?

It is possible for EM-style methods to converge to a model where some components are unused, like you are finding with DA. In this case you have a bad local optimum and you need to jump out of it somehow.

In general, note that none of the following are guaranteed:
DA will find the global optimum.
MEME finds the global optimum.
The global optimum is the optimum from the end-user perspective.
The 10,1 DA schedue is effective for any particular task.
Matt Jacobsen  113
11-26-2007 01:26 AM ET (US)
I don't quite understand how to calculate the the likelihood from the MEME output. In my code, I calculate the log likelihood for each subsequence and sum it together. But this requires my final theta1, theta2, lambda1, lambda2, and Z_ij matrix. I don't get this from the MEME output. I only get (as far as I can see) the theta1 matrix, called the "letter-probability matrix". Am I missing something from the MEME output? Is there a better way of calculating the likelihood using the MEME output?

Using the "letter-probability matrix" as the initial theta1 (along with all my other initial parameters) and no annealing, I converge to the same results as MEME in just 2 iterations. When I calculate the sum log likelihood after this convergence, I get a value of -65223, whereas my not-so-crisp results produce a sum log likelihood of -65015 consistently. I would expect the log likelihood to be some kind of indicator of how well the Z_ij, theta, and lambda categorize the subsequence examples. So how can this be?

More alarmingly, when I start with MEME's "letter-probability matrix" as my initial theta1, but run EM with a 10,1 deterministic annealing T schedule, I get a Z_ij matrix that attributes absolutely all subsequences to the background model (i.e. all Z_i1 are 0 and Z_i2 are 1). This is even more confusing. Shouldn't the deterministic annealing approach find the global optimum (especially if it starts with a theta1 that's the result of finding the global optimum)?

Thanks,
Matt
Charles Elkan  112
11-26-2007 12:04 AM ET (US)

if I get a single subsequence with Z_i1 > 0.5 for a particular protein, is it the case that it has at least 1 motif? Because it seems like I'm getting at least 1 subsequence instance that's attributed > 50% to the motif component for all proteins I've tested. So I'm not encountering any 0 motif examples when I believe MEME is for the same protein in the same test.


It sounds like you are finding a motif that "hits" very many subsequences. This is consistent with getting a motif model that is not crisp, i.e. does not have very high probabilities for any aa in any position, as described in /m107. So my guess is that this is another aspect of the problem of not finding a good local optimum.
Charles Elkan  111
11-25-2007 11:59 PM ET (US)

MEME converges to more extreme values than our model. Our model has less extreme values and the values stay constant over position. This happens for letters that have a higher background probability. Has anybody else seen this? We know we are running our model to convergence, but why does it not converge as sharply as MEME does?


My guess is that your code is converging to a bad local optimum. You could try starting with the MEME output. You should find (a) this has better likelihood and (b) your code then converges to something with even better likelihood.

MEME does have a successful domain-specific way of finding god local optimum. The idea is as follows:

(1) Use every subsequence as an initial guess; run EM for just a few iterations from it.
(2) For just the top few subsequences from (1), run EM to convergence.
Matt Jacobsen  110
11-25-2007 09:44 PM ET (US)
I'm still a bit confused about how we should select the best subsequence representing a motif so as to compare with MEME's subsequence output. I believe it was posted earlier that we can find these subsequences by selecting the ones that have a higher probability of being from the motif component (i.e. Z_i1 > 0.5). But this selects many subsequences (first problem). I can look at all of them to see if one matches the one MEME selects as its "motif 1 site" but that seems rather unscientific.

I though I could select the subsequence with the largest Z_i1 probability but now that I'm using deterministic annealing, there are many now with 1.0 values (second problem).

This then begs the question again, if I get a single subsequence with Z_i1 > 0.5 for a particular protein, is it the case that it has at least 1 motif? Because it seems like I'm getting at least 1 subsequence instance that's attributed > 50% to the motif component for all proteins I've tested. So I'm not encountering any 0 motif examples when I believe MEME is for the same protein in the same test.

Thanks,
Matt
Stefano  109
11-25-2007 09:35 PM ET (US)
Hi Matt,

Yes, I've noticed that specifically the theta1 matrix does not have as large (extreme) probabilities for the "most-likely" proteins for each position.

I thought it had something to do with my incorrect interpretation (and implementation) of the normalization of the Z_i1 vector. But, when I looked at the theta1 probability matrix, the corresponding motif was not trivial (well, at least not a single protein or 2 repeated proteins) using the Z_i1 for the M-step.

So, the largest probabilities in our theta1 are noticeably larger than the rest for a particular position in the subsequence, but not as large as the ones MEME creates, such as 0.93 or 0.96, etc...

Our likelihood improves each iteration as well, which is why I think its ok...

I'm going to venture a guess and say the initial starting values for the theta1 and theta2 are too far away from the desired ones and we've reached a local max. I've tried some different (simple) initialization methods, but so far it hasn't cause the theta values to be more extreme. This is for EM not the DA.

Sorry I couldn't answer your question, but it seems to be happening to us as well.
Stefano  108
11-25-2007 09:20 PM ET (US)
Are we supposed to normalize the Z_i1 vector for all subsequences of length W such that the sum over each subsequence is <= 1? This is described as the z_ij matrix (small z matrix) in the paper. Are we supposed to use this z_ij matrix (small z) during the M-step instead of the Z_ij (big Z) matrix for j=1,2 and i=1..n?

In the paper this is done to prevent the EM from converging to simple motifs, but this then breaks the property making each a probability, Z_i1 + Z_i2 = 1 (unless we're supposed to re-scale Z_i2 as well?) Or is there some other way to prevent convergence to simple motifs?

I had thought we did it correctly, until I re-read this part. Could someone clarify? Thanks!
Matt Rodriguez  107
11-25-2007 09:14 PM ET (US)
Hi everybody

My partner and I believe that we are running the EM correctly because the likelihood improves at each iteration. However we have noticed that our theta1 model parameter does not have as extreme values as the theta1 in MEME.

We've generated a simplified position probability matrix by multiplying each value by 10 and rounding it and we get something like this:

A 1:122:1::1::11111111
C ::::::::::::::::::::
D :111:::::::::::::1:1
E 1:::::::::1:1::1::::
F 1:::::::11::1::::1::
G 111111212221111121:2
H ::::::::::::::::::::
I 22111:1121:22:11:11:
K :11:1:1::::1::1::::1
L :111:::1::::1::1::1:
M 1::1:1:::::::::1::1:
N ::1111:1:111:111:1::
P :::::1::1::1:11::1:1
Q :::::::::::::11:::::
R 1111:1:1:1:::1::::::
S 1:::11:11:11:11:2:11
T :1:11:11:1::11::1:11
V :::1:121221111211211
W ::::::::::::::::::::
Y ::::::::::1:1:11:11:
X ::::::::::::::::::::
background model
A 1.096481e-001
C 6.018062e-003
D 4.623882e-002
E 6.088052e-002
F 2.949489e-002
G 9.469746e-002
H 1.283185e-002
I 5.853482e-002
K 5.132334e-002
L 9.456777e-002
M 2.871076e-002
N 4.549275e-002
P 3.586063e-002
Q 2.397957e-002
R 4.409827e-002
S 6.312976e-002
T 6.590536e-002
V 9.949338e-002
W 8.888773e-003
Y 2.020036e-002
X 4.714461e-006

Now when we run MEME the simplified position probability matrix is this
Simplified A :::3::::::3:::::::::
pos.-specific C ::::::::::::::::::::
probability D ::3:::::::::3::::5::
matrix E ::::::::::::::::::::
                  F ::::::::::::::::::::
                  G 53::::::3:::53:::::3
                  H ::3:::::::5:::::::::
                  I 3:::a::::3::::3:::::
                  K ::33::::::::::::::::
                  L :::::::::::::::3::::
                  M :::::::::::::3::::a3
                  N :::3:::8::3::::3::::
                  P :3:::::::::a:::::5::
                  Q 3:::::::::::::::::::
                  R :3:3:a::::::::::::::
                  S ::::::::8:::::::::::
                  T :3::::::::::3:3:a::5
                  V ::::::a3:8:::55:::::
                  W ::::::::::::::::::::
                  Y ::3::::::::::::5::::


There are at least 2 things to note. MEME converges to more extreme values than our model. Our model has less extreme values and the values stay constant over position. This happens for letters that have a higher background probability. Has anybody else seen this? We know we are running our model to convergence, but why does it not converge as sharply as MEME does?

Thanks,
Matt
Charles Elkan  106
11-24-2007 11:27 PM ET (US)
"For Zij(0), the part specifically that I don't get is the
P(Xi|theta_j). Are we supposed to use equations (7) and (8) to
calculate that?"
Yes, this is the E-step.

"And if that is the case, then we estimate the f_ij's from the data, specifically on the subsequences of length W?"
Yes, this is the M-step.

"Then lambda is set initially (I think it is discussed in the
paper and earlier on this board), and we can calculate Z_ij(0)."

Yes. As part of each M-step you recompute lambda and all f_ij. Then you use these values in the next E-step. This next E-step gives you new Zij values. You use these in the next M-step, and so on.
Jerry Fu  105
11-24-2007 10:13 PM ET (US)
I'm a little confused on how we're supposed to calculate Zij(0) (equation (4)). From the paper, I didn't quite get which parameters we are supposed to get directly from the data.

For Zij(0), the part specifically that I don't get is the P(Xi|theta_j). Are we supposed to use equations (7) and (8) to calculate that? And if that is the case, then we estimate the f_ij's from the data, specifically on the subsequences of length W?

Then lambda is set initially (I think it is discussed in the paper and earlier on this board), and we can calculate Z_ij(0).
Charles Elkan  104
11-21-2007 02:10 PM ET (US)
I believe X means "unknown amino acid." There is a whole code of letters that denote various ambiguities, i.e. subsets of amino acids. The real-world MEME can deal with these codes. You can just eliminate the subsequences from this protein that contain the X.

Charles



On Wed, 21 Nov 2007, QT - Klinton Bicknell wrote:

>
< replied-to message removed by QT >
Klinton Bicknell  103
11-21-2007 12:52 PM ET (US)
The protein MAS1_AGRRA has a single X in its sequence, a character not found in any of the other proteins, and the output for MEME does not include X in its vocabulary. Is this a mistake in our data that wasn't present in the data MEME got? Or is MEME just ignoring units of vocabulary that don't occur below a certain threshold number of times?
Charles Elkan  102
11-21-2007 01:06 AM ET (US)
"After running EM to convergence, we'll have probability distributions per letter for the background model and per letter, per position for the motif model. But how can we detect what the best motif is from these
distributions?"

The best motif *is* the foreground distributions.

Here "best" means "part of the model with highest likelihood, as found by EM." For all the reasons mentioned in class, this "best" motif may not be exactly the one that biologists would consider best.

"How would we know that no motif is present?"

For each subsequence, z is the probability that it was generated by the foreground component. If z > 0.5 then you can say that this subsequence is an instance of the motif. Otherwise, say that it is an instance of the background, i.e. not of the motif.
Charles Elkan  101
11-21-2007 12:48 AM ET (US)
"Regarding the midterm: Will there be posted a solution example?"
I don't have a sample solution ready currently. Feel free to ask here about any particular question you are unsure about.
Matt Jacobsen  100
11-20-2007 09:11 PM ET (US)
After running EM to convergence, we'll have probability distributions per letter for the background model and per letter, per position for the motif model. But how can we detect what the best motif is from these distributions? We're thinking that we can subtract the background probabilities from the motif probabilities on a per position basis. Then assume the maximum probability left per position represents the motif letter in that position. But this would produce a motif every time and of length W. So that doesn't seem correct. How would we know that no motif is present?

Thanks,
Matt
Jostein Oysad  99
11-20-2007 08:45 PM ET (US)
Regarding the midterm:
Will there be posted a solution example?
Charles Elkan  98
11-19-2007 09:05 PM ET (US)
> Last year's midterm was open book and calculator allowed. Does
> the same rule apply for this year's midterm? Since we don't have
> required text books, are we allow to bring our notes?

Yes to everything.
Sean Park  97
11-19-2007 02:08 PM ET (US)
Last year's midterm was open book and calculator allowed. Does the same rule apply for this year's midterm? Since we don't have required text books, are we allow to bring our notes?

Thanks,

Good luck everyone~
Charles Elkan  96
11-19-2007 12:16 PM ET (US)
"1) How is overfitting dealt with in unsupervised algorithms. We want to figure out parameters and whatnot to our algorithm. Do we need to do this on a subset of the data rather than all of the data to avoid overfitting? (validation set vs whole set)? Normally the validation set is chosen out of the final training set to avoid biasing the final results, but here we only have one set so how does this work?"

There is no good answer to this question. Commonly, one does smoothing during training so that zero-probability in the training set does not imply zero-probability on the test set. The strength of smoothing can be determined by measuring likelihood on a test/validation set. This set may be independent, or simulated via cross-validation.


> 2) Should we worry about the Beta value in equation 13? If so, what
> should it be set to originally? In the implementation section it just
> says that Beta is user-specified (based on information available about
> the distributions of the motifs and background) and we cannot find any
> further elaboration in the paper (what information about the
> distributions?). If it is user-specified, what should we specify it as?

This beta_j is the strength of smoothing for letter number j. It is essentially the same as the pseudo-count for a naive Bayes classifier. So you can start with each beta_j = 0.1. Feel free to do experiments with different ways of setting beta_j (they do not have to be all equal).

"3) What about the Erasing factor described in footnote 1? Should we worry about this?"

You don't need to worry about this, because it is only relevant when finding two or more motifs. Your job is just to find one.
Jostein Oysad  95
11-18-2007 11:38 PM ET (US)
Couple of questions:

1) How is overfitting dealt with in unsupervised algorithms. We want to figure out parameters and whatnot to our algorithm. Do we need to do this on a subset of the data rather than all of the data to avoid overfitting? (validation set vs whole set)? Normally the validation set is chosen out of the final training set to avoid biasing the final results, but here we only have one set so how does this work?

2) Should we worry about the Beta value in equation 13? If so, what should it be set to originally? In the implementation section it just says that Beta is user-specified (based on information available about the distributions of the motifs and background) and we cannot find any further elaboration in the paper (what information about the distributions?). If it is user-specified, what should we specify it as?

3) What about the Erasing factor described in footnote 1? Should we worry about this?

Thanks for any help!
Charles Elkan  94
11-18-2007 08:55 PM ET (US)
"Is the theta parameter for a likelihood function that is fitting a multinomial is theta = (p1, p2, p3, ... pL), where there are L
possible outcomes and pi is the probability of that outcome?"

Correct.


"To maximize theta I need to be the likelihood function will
be something like this L(X;theta) = C p1^x1^p2^x2...pL^xL. C is
a term in the multinomial distribution pmf that isn't important
to maximize the likelihood because it doesn't depend on theta.

Correct.


What is the log likelihood function that is to be maximized? Is
it SUM_i pi^xi? If so then how do I maximize that function with
respect to theta?

Remember to take logs inside the product. The log likelihood is SUM_i x_i*(log p_i). You can maximize this using a Lagrange
multiplier to ensure SUM_i p_i = 1.
Matt Rodriguez  93
11-18-2007 08:11 PM ET (US)
I've been thinking about likelihood functions and fitting multinomial distributions and I want to ensure that my thinking is correct.

1.
Is the theta parameter for a likelihood function that is fitting a multinomial is theta = (p1, p2, p3, ... pL), where there are L possible outcomes and pi is the probability of that outcome?

2. To maximize theta I need to be the likelihood function will
be something like this L(X;theta) = C p1^x1^p2^x2...pL^xL. C is a term in the multinomial distribution pmf that isn't important to maximize the likelihood because it doesn't depend on theta. What is the log likelihood function that is to be maximized? Is it SUM_i pi^xi? If so then how do I maximize that function with respect to theta?

Thanks
Charles Elkan  92
11-17-2007 08:13 PM ET (US)
"In the sample output, I understand how to get Simplified
probability matrix but I don't really get the information
content graph with bits on it."

"Bits of information" means entropy. The entropy of a discrete
distribution is SUM_i p_i*log(p_i) where i ranges over the possible outcomes and p_i is the prob. of outcome i. Higher entropy means higher uncertainty means more bits of information. An ideal motif has low entropy in each column. However, real-world motifs usually have some columns that are not strongly conserved, i.e. do not have low
entropy.

Note that entropy is maximized by a uniform distribution.
Charles Elkan  91
11-17-2007 08:09 PM ET (US)
"one way to initialize a deterministic
annealing EM is to choose theta/lambda to maximize the amount of
search that EM will do by making each initial component a
perturbed version of a global model of the data. Can you
elaborate on this? I'm not quite sure what that means. Does that
mean assume 1 component (say only background or only motif) and
assume some kind of distribution for that 1 component?"

Yes. Fit a multinomial to the entire dataset by ML. This will already be approximately the correct background model. Initialize the
motif-component to have every column equal to this multinomial, but with random changes to the 20 probabilities.
Sean Park  90
11-17-2007 03:46 PM ET (US)
In the sample output, I understand how to get Simplified probability matrix but I don't really get the information content graph with bits on it. Can anyone explain what the bits are?

Thanks,
Matt Jacobsen  89
11-17-2007 02:45 PM ET (US)
My partner and I are trying to figure out a good way to produce initial parameters and are a bit stumped. In the Mixture Models notes, you suggest that one way to initialize a deterministic annealing EM is to choose theta/lambda to maximize the amount of search that EM will do by making each initial component a perturbed version of a global model of the data. Can you elaborate on this? I'm not quite sure what that means. Does that mean assume 1 component (say only background or only motif) and assume some kind of distribution for that 1 component?

Any clarification would help.

Thanks,
Matt
Charles Elkan  88
11-16-2007 05:03 PM ET (US)
"As I read the Bailey & Elkan paper, it seems that the indicator
function I(k,a) in equations 7 and 8 is really defined as 1 if a
= k, 0 otherwise. Because the arguments to it in those equations
are k and X_ij. Here k is a letter (from the summation) and so
is X_ij (the letter in the jth position from sample X_i). It
doesn't seem to make sense otherwise, right?"

Essentially you are right. I(k,X_ij) = 1 iff X_ij is the kth letter, i.e. letter number k. Earlier the paper says that the kth letter is a_k, which is not quite the same thing as saying that the kth letter is the integer k.

It probably would have been better to use the notation I(X_ij = a_k).
Matt Jacobsen  87
11-16-2007 04:52 PM ET (US)
As I read the Bailey & Elkan paper, it seems that the indicator function I(k,a) in equations 7 and 8 is really defined as 1 if a = k, 0 otherwise. Because the arguments to it in those equations are k and X_ij. Here k is a letter (from the summation) and so is X_ij (the letter in the jth position from sample X_i). It doesn't seem to make sense otherwise, right?

Thanks,
Matt
Charles Elkan  86
11-16-2007 12:55 AM ET (US)
You can find last year's midterm here:
http://www-cse.ucsd.edu/users/elkan/250B/oldmidterm.pdf

Last year I covered some topics in a different order, so last year's students had more background for some of these questions, especially the first one and the true/false statements 3, 4, 9, 10.
Kristen Jaskie  85
11-15-2007 08:51 PM ET (US)
Midterm review? You asked us to remind you.
Charles Elkan  84
11-13-2007 02:19 AM ET (US)
You are right, there are only 33 protein sequences in the adh.s dataset. Just use these. I'm not sure where I got the number 132...
Jostein Oysad  83
11-12-2007 10:38 PM ET (US)
For project 3, the dataset appears to have only 33 protein sequences. This is also the number that the MEME output describes. In the writeup, you said that our experiments should use 132 proteins. Are we misreading the data or is that a typo?
Thanks!
Charles Elkan  82
11-10-2007 01:15 AM ET (US)
Yes, the deadline for Assignment 3 is Tuesday November 27 in class. Sorry for my mistake on the handout.

The midterm is on November 20, remember.




"The assignment says that it is due Nov 20, which I think is the
Tuesday before Thanksgiving ( and the same day as the
midterm!)...should that be Nov 27?"
Jerry Fu  81
11-09-2007 02:36 AM ET (US)
The assignment says that it is due Nov 20, which I think is the Tuesday before Thanksgiving ( and the same day as the midterm!)...should that be Nov 27?
Sean Park  80
11-08-2007 09:34 PM ET (US)
Hey guys,
I'm looking for a enthusiastic person as a partner who's willing to start early and work hard. I'll work as hard, too. please email me at orgpark@gmail.com

Thank you,
Charles Elkan  79
11-07-2007 06:52 PM ET (US)
The new 250B project description is available here:
http://www.cs.ucsd.edu/users/elkan/250B/project3.html

The deadline is the Tuesday after Thanksgiving, but please do start work as soon as possible.

You will see that the project requires you to absorb a fair amount of background knowledge about protein sequence analysis. Do make this process efficient by asking clarification questions here.
Charles Elkan  78
11-05-2007 08:19 PM ET (US)
"Are we supposed to report on the performance on the 95K example
validation set provided on the web site? Or just on the test
set(s) carved out of the 95K example training set we downloaded?"
Either or both, your choice.


"I'm remembering that you said not to download the true values of the validation set, is that still the case?"

You can download the true labels of the 95K official test set now, if you like, but you can only run *one* experiment using these.
Stefano  77
11-05-2007 08:05 PM ET (US)
Are we supposed to report on the performance on the 95K example validation set provided on the web site? Or just on the test set(s) carved out of the 95K example training set we downloaded?

I'm remembering that you said not to download the true values of the validation set, is that still the case?
Charles Elkan  76
11-05-2007 07:07 PM ET (US)
"So to obtain calibrated probability, do we actually use the true label of test set?"

No, you use the true labels from a validation set in order to learn the function that maps scores into calibrated probabilities.


"Then using the calibrated prob, we calculate the profit. Am I
interpreting it correcly?"

You use the calibrated predictions on the test set to make optimal decisions for the test set. Then, you use the true labels from the test set to measure the actual profit you would have made. You do not use the true labels of the test set for training in any way.
Sean Park  75
11-05-2007 06:41 PM ET (US)
So to obtain calibrated probability, do we actually use the true label of test set? Then using the calibrated prob, we calculate the profit. Am I interpreting it correcly?
Charles Elkan  74
11-05-2007 11:59 AM ET (US)
"When you compute each score, you have to include its
denominator. Then, these scores are the single x variable in a
univariate regression. The y variable is the true label y=0 or
y=1."

"is it okay to perform regular linear regression using x and y?"

You can try standard linear regression as a first cut, but it is really not appropriate because you want to guarantee that the
predictions yhat = f(x) are in the range 0 to 1. Logistic regression guarantees this, but not linear regression.
Sean Park  73
11-05-2007 11:29 AM ET (US)
"No... When you compute each score, you have to include its denominator. Then, these scores are the single x variable in a univariate regression. The y variable is the true label y=0 or y=1."

If I understood correctly, is it okay to perform regular linear regression using x and y? and is it sort of isotonic regression?
Sean Park  72
11-05-2007 10:54 AM ET (US)
Deleted by author 11-05-2007 11:28 AM
Charles Elkan  71
11-05-2007 06:59 AM ET (US)
You wrote ""scores" are the numerators we calculate for each examples, and in our case, we want to use the value of N(C=donor) over all training examples as the y-variable in a regression, correct?"

No... When you compute each score, you have to include its denominator. Then, these scores are the single x variable in a univariate regression. The y variable is the true label y=0 or y=1.

"So then do we scale the scores so that they span from 0 to 1"
After you include the denominator, this will be guaranteed.
Jerry Fu  70
11-04-2007 10:32 PM ET (US)
Hi

We're a bit confused about turning the scores generated by the Bayesian classifier into well-calibrated probabilities. We have an idea, but we're not sure if it's correct. "scores" are the numerators we calculate for each examples, and in our case, we want to use the value of N(C=donor) over all training examples as the y-variable in a regression, correct?

So then do we scale the scores so that they span from 0 to 1, and do a regression with the feature values as our variables? The only problem is that I believe what I just described is multiple regression, and the project description says to use isotonic or logistic univariate regression. Thanks for any help on this!
Charles Elkan  69
11-04-2007 04:36 PM ET (US)
"I'm a bit confused on how to calculate confidence intervals for the total number of donors (predicted) and total amount donated (predicted)."
Obtaining a CI for the total amount is harder, so if you obtain a CI for the number of donors, that will be enough.

What you want is P(count in CI) = 0.95. You can assume that the CI is approximately [m - 2s, m+2s] where m = E[count] and s^2 = Var[count].
Expectations and variances are additive, so E[count] = SUM_i p_i
where i ranges over all test examples and p_i is the probability that test example i is a donor.

The variance of the count is Var[count] = SUM_i p_i*(1-p_i)
since the variance of a single coin flip with probability p_i is p_i*(1-p_i).
Charles Elkan  68
11-04-2007 04:30 PM ET (US)
"the results of cross validation seems to be too low, aroun 10%
precision, which is only improved twice after the algorithm. It
is too low, right?"

Not necessarily. I think it is very hard to get precision over 20%, even just on the top 1%.

"Strange thing is that when I apply different features, the outcome does not change much."

This is not surprising. None of the features are individually very predictive. Most features have no predictive value at all.
Sean Park  67
11-04-2007 02:47 PM ET (US)
Edited by author 11-04-2007 02:49 PM
After discretizing features, I applied Naive Bayes on selected features and assumed top 10% probabilities are donors. However, the results of cross validation seem to be too low, around 10% precision, which is only improved twice after the algorithm. What is the range of precision expected approximately? Strange thing is that when I apply different features, the outcome does not change much.
Thanks,
Matt Jacobsen  66
11-04-2007 02:07 PM ET (US)
I'm a bit confused on how to calculate confidence intervals for the total number of donors (predicted) and total amount donated (predicted). I know how to calculate the predictions, just not how to prove that they're bound by an interval with 95% confidence. I've been reading All of Statistics, and I believe understand the concept:
    P(p in C_n) >= 1-alpha
Here, I'd use alpha=0.05 and p is my prediction. But I'm unsure what C_n is for my distribution. It seems like I have a binomial distribution. So perhaps I can use:
    C_n=(Xh - epsilon, Xh + epsilon).
But I'm unsure how to calculate epsilon using alpha (to ensure 95%) nor what Xh would be in this case. Any suggestions would be helpful.

Thanks,
Matt
Charles Elkan  65
11-02-2007 02:07 PM ET (US)
"I'm thinking about how to provide a numerical estimate for profit that we expect to receive. To answer this question we need to know how many donors we will solicit or have the choice to solicit. How many donors should we assume that we will have the choice to solicit?"

You choose how many donors to solicit. The optinal choice for person x is to solicit iff p(x)*e(x) > 0.68 where p(x) is the estimated calibrated probability that x will donate, and e(x) is the estimated amount s/he will donate, assuming s/he does donate.


"It is possible to have a probability distribution of profit for a single donor."

You don't need a whole distribution. You need just a single best-guess dollar number. You get this from linear regression.


"I don't know how to create a probability distribution for say the number of do