QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: CSE 250B Fall 2007
Printer-Friendly Page
All messages    << 129-144  113-128 of 161  97-112 >>
About these ads
Who | When
Messagessort recent-top    (not accepting new messages)
Matt Jacobsen  113
11-26-2007 01:26 AM ET (US)
I don't quite understand how to calculate the the likelihood from the MEME output. In my code, I calculate the log likelihood for each subsequence and sum it together. But this requires my final theta1, theta2, lambda1, lambda2, and Z_ij matrix. I don't get this from the MEME output. I only get (as far as I can see) the theta1 matrix, called the "letter-probability matrix". Am I missing something from the MEME output? Is there a better way of calculating the likelihood using the MEME output?

Using the "letter-probability matrix" as the initial theta1 (along with all my other initial parameters) and no annealing, I converge to the same results as MEME in just 2 iterations. When I calculate the sum log likelihood after this convergence, I get a value of -65223, whereas my not-so-crisp results produce a sum log likelihood of -65015 consistently. I would expect the log likelihood to be some kind of indicator of how well the Z_ij, theta, and lambda categorize the subsequence examples. So how can this be?

More alarmingly, when I start with MEME's "letter-probability matrix" as my initial theta1, but run EM with a 10,1 deterministic annealing T schedule, I get a Z_ij matrix that attributes absolutely all subsequences to the background model (i.e. all Z_i1 are 0 and Z_i2 are 1). This is even more confusing. Shouldn't the deterministic annealing approach find the global optimum (especially if it starts with a theta1 that's the result of finding the global optimum)?

Thanks,
Matt
Charles Elkan  114
11-26-2007 02:05 AM ET (US)

Using the "letter-probability matrix" as the initial theta1 (along with all my other initial parameters) and no annealing, I converge to the same results as MEME in just 2 iterations. When I calculate the sum log likelihood after this convergence, I get a value of -65223, whereas my not-so-crisp results produce a sum log likelihood of -65015 consistently.

Yes, what you are doing is the best way of finding the log-likelihood of the MEME output. It seems your model has better log-like. than the MEME output. A plausible explanation is that MEME is tuned to find crisp motifs; maximizing likelihood is not the perfect objective from a biological perspective.


More alarmingly, when I start with MEME's "letter-probability matrix" as my initial theta1, but run EM with a 10,1 deterministic annealing T schedule, I get a Z_ij matrix that attributes absolutely all subsequences to the background model (i.e. all Z_i1 are 0 and Z_i2 are 1). This is even more confusing.

Yes, this is strange. Is the log-likelihood increasing at every iteration; Is the final log-likelihood better than without DA?

It is possible for EM-style methods to converge to a model where some components are unused, like you are finding with DA. In this case you have a bad local optimum and you need to jump out of it somehow.

In general, note that none of the following are guaranteed:
DA will find the global optimum.
MEME finds the global optimum.
The global optimum is the optimum from the end-user perspective.
The 10,1 DA schedue is effective for any particular task.
Matt Jacobsen  115
11-26-2007 12:29 PM ET (US)
<i.More alarmingly, when I start with MEME's "letter-probability matrix" as my initial theta1, but run EM with a 10,1 deterministic annealing T schedule, I get a Z_ij matrix that attributes absolutely all subsequences to the background model (i.e. all Z_i1 are 0 and Z_i2 are 1). This is even more confusing.

Yes, this is strange. Is the log-likelihood increasing at every iteration; Is the final log-likelihood better than without DA?</i>

The log-likelihood is slightly worse, at the end, compared to not using DA. Using DA and MEME's theta1, value is -65223. Not using DA, but using MEME's theta1, value is -65072. Using our own initial theta1 with and without DA, our values are the same, -65015.

When starting with our initial theta1, the log-likelihood values are always increasing with every iteration. However, when starting with MEME's theta1 this is not true. During the t=10 stage, the first 6 iterations actually decrease the log-likelihood (approaching -65223). If I didn't see that our own theta1 + DA experiment had always increased log-likelihood, I'd expect this could happen because of the soft assignments. But now I'm very confused. Why would log-likelihood decrease when run using MEME's theta1?

Thanks,
Matt
Jerry Fu  116
11-26-2007 02:01 PM ET (US)
"Are we supposed to normalize the Z_i1 vector for all subsequences of length W such that the sum over each subsequence is <= 1? This is described as the z_ij matrix (small z matrix) in the paper. Are we supposed to use this z_ij matrix (small z) during the M-step instead of the Z_ij (big Z) matrix for j=1,2 and i=1..n?

In the paper this is done to prevent the EM from converging to simple motifs, but this then breaks the property making each a probability, Z_i1 + Z_i2 = 1 (unless we're supposed to re-scale Z_i2 as well?) Or is there some other way to prevent convergence to simple motifs?

I had thought we did it correctly, until I re-read this part. Could someone clarify? Thanks!"

I think we are supposed to normalize it. But one catch is that the way this is described in the paper, it only normalizes over zij that are on the same line (i.e. from the same protein). My program has all the subsequences of length W stored in a vector like you mentioned (referred to as X in the paper), but I also create the matrix z_ij as described in the paper using the vector X and the lengths of each protein in the dataset Y. The thing is, I use z_ij only to normalize...just about everything else is performed using X, so I think I could probably write a loop to normalize X directly. I think the loops would have to use the lengths though, because in the array X you can't tell when you've wrapped around to a new line of z_ij (i.e. a new protein).

As for Zi2, I though we can get away without normalizing it, because I thought it is only used only for the recalculating f0...but I just noticed it is also being used for lambda. so maybe we do need to normalize that one also.
Jerry Fu  117
11-26-2007 02:21 PM ET (US)
This was mentioned earlier in the board, but what does it mean to get Z_i1 > 0.50 for a particular protein? Does it mean that some subsequence of length W from that protein had a value > 0.5?

Also, do we need to look at value from the Z matrix? I suppose it is one way of figuring out which subsequence belong to the motif distribution, but this information is factored into theta already (except that theta only gives a probability distribution of letters and positions).

Speaking of theta, I know people have said that the values are not as extreme as MEME, but in my theta, the highest values that I am getting are about 0.2 (for any given letter in a given position)...that doesn't seem very extreme at all.

Lastly, the project description says that we should implement a streamlined version of MEME that fixes any mathematical errors. I've been busy implementing what is described, although I've left out some thing like erasing that we don't need. But does anyone have suggestions or hints on what needs to be fixed with the original implementation? The one thing I see so far changing is not using the z_ij as described in the paper, because it ends up being pretty sparse, and just working with a vector of the subsequences. This doesn't really streamline too much though, besides not needing to create an extra matrix.
Matt Rodriguez  118
11-26-2007 03:03 PM ET (US)
How many local optima can there be in terms of N number of training examples,W length of subsequence, and L number of letters? Can this be determined?

I know in a polynomial of root n there can be n -1 local optima, but this is a lot more complicated than that.

Cheers
Charles Elkan  119
11-26-2007 03:31 PM ET (US)
Edited by author 11-26-2007 03:31 PM

The log-likelihood is slightly worse, at the end, compared to not using DA. Using DA and MEME's theta1, value is -65223. Not using DA, but using MEME's theta1, value is -65072. Using our own initial theta1 with and without DA, our values are the same, -65015.

There is no guarantee that any particular DA schedule is useful for any particular task, so these results don't necessarily need an explanation.


When starting with our initial theta1, the log-likelihood values are always increasing with every iteration. However, when starting with MEME's theta1 this is not true. During the t=10 stage, the first 6 iterations actually decrease the log-likelihood (approaching -65223). If I didn't see that our own theta1 + DA experiment had always increased log-likelihood, I'd expect this could happen because of the soft assignments.

I think it is possible that the extra-soft assignments used by DA may cause the log-likelihood not to increase at every iteration. This phenomenon could show up for some uses of DA but not others.


But now I'm very confused. Why would log-likelihood decrease when run using MEME's theta1?

DA is one candidate explanation. Another may be that standard EM is only guaranteed to converge to a saddle point, not necessarily to a local maximum. Unfortunately I don't know any good readable full analysis of EM's convergence properties. This is frustrating because without a full theoretical analysis, it's hard to know whether the code might have a bug.
Charles Elkan  120
11-26-2007 03:34 PM ET (US)

Are we supposed to normalize the Z_i1 vector for all subsequences of length W such that the sum over each subsequence is <= 1? This is described as the z_ij matrix (small z matrix) in the paper.


You do not need to do this normalization. It is important in practice for DNA motifs. It is less important for protein motifs. You do not need to do it now partly because it is just a heuristic. Some recent research has provided an alternative approach to avoiding motifs that self-overlap that is more mathematically justified, but that is far outside the scope of this project.
Charles Elkan  121
11-26-2007 03:40 PM ET (US)
Edited by author 11-26-2007 03:40 PM

what does it mean to get Z_i1 > 0.50 for a particular protein?


Z values are not defined for whole proteins. They are just defined for subsequences. Z(protein i, position j) > 0.5 means over 50% chance that *this* subsequence is a motif instance.


Lastly, the project description says that we should implement a streamlined version of MEME that fixes any mathematical errors. I've been busy implementing what is described, although I've left out some thing like erasing that we don't need. But does anyone have suggestions or hints on what needs to be fixed with the original implementation?

"Streamlined" means only one of OOPS, ZOOPS, etc., no multiple motifs, no automatic width determination, etc. "fixes any mathematical errors" just means beware that some notation or other details may be wrong in the original paper, just as it may be wrong in any published paper. Our (my) understanding of EM is better now than it was in 1996 (but still imperfect!).
Charles Elkan  122
11-26-2007 03:42 PM ET (US)

How many local optima can there be in terms of N number of training examples, W length of subsequence, and L number of letters?


I'm pretty sure the answer is an exponential number. It has been proved that with a certain formalization of the task, finding the globally optimal motif is NP-hard.
Jerry Fu  123
11-27-2007 02:33 AM ET (US)
Edited by author 11-27-2007 02:41 AM
When we run our EM algorithm with deterministic annealing, we're noticing a strange (possibly) things.

1) When we tried T = 20 and 10, the process takes only a few iterations to converge (i.e. the euclidean distance between subsequent thetas (or f) array falls below a threshold of 10^-6). Usually, if we follow the 20 or 10 with 8, 6, or 5, the EM runs many iterations before converging (several hundred to 1000). While it's iterating, the E(LL) is always increasing, but the change in theta usually starts out decreasing, but then starts to increase before falling below the threshold of 10^-6. Then suddenly, it'll just fall off. After that, the remaining values of T all converge quickly (< 100 iterations).

We are not supposed to do anything to theta between T values right? we don't have to scale or multiply them by the root that we were taking it at do we.

2) the theta that we get frrom deterministic annealing has the same probability distribution at each position across all letters.

Should be be stopping iterations once the change in theta increases for a given value of T? I'm wondering if maybe we let EM run "too far."

UPDATE: I tried T = [2 1], and we seemed to get reasonable values for theta when I let that run to convergence.
Charles Elkan  124
11-27-2007 10:35 AM ET (US)

1) When we tried T = 20 and 10, the process takes only a few iterations to converge (i.e. the euclidean distance between subsequent thetas (or f) array falls below a threshold of 10^-6). Usually, if we follow the 20 or 10 with 8, 6, or 5, the EM runs many iterations before converging (several hundred to 1000). While it's iterating, the E(LL) is always increasing, but the change in theta usually starts out decreasing, but then starts to increase before falling below the threshold of 10^-6. Then suddenly, it'll just fall off. After that, the remaining values of T all converge quickly (< 100 iterations).

I am not surprised by the above. When T is very high (what this is is problem-dependent) then the LL surface is vey smooth and convergence is fast.

We are not supposed to do anything to theta between T values right? we don't have to scale or multiply them by the root that we were taking it at do we.

Correct.

2) the theta that we get frrom deterministic annealing has the same probability distribution at each position across all letters.

This may be because you started with T too high and the LL surface was flat.


Should be be stopping iterations once the change in theta increases for a given value of T? I'm wondering if maybe we let EM run "too far."

No.


UPDATE: I tried T = [2 1], and we seemed to get reasonable values for theta when I let that run to convergence.

Good!
Charles Elkan  125
11-30-2007 04:42 PM ET (US)
Lecture notes on log-linear models and conditional random fields are available now at http://www.cs.ucsd.edu/users/elkan/250B/loglinear.pdf

Questions are welcome here!
Kristen Jaskie  126
12-03-2007 02:59 PM ET (US)
Are you going to be covering the Viterbi algorithm and matrix multiplication algorithm in class? The paper is kind of confusing. In particular, we are having trouble understanding U(k,v).
Charles Elkan  127
12-03-2007 03:37 PM ET (US)

Are you going to be covering the Viterbi algorithm and matrix
multiplication algorithm in class? The paper is kind of
confusing. In particular, we are having trouble understanding
U(k,v).

Yes, these two algorithms will be the main topic tomorrow, Tuesday.
U(k,v) is an entry in a 2D matrix. k is an integer between 1 and the length of ybar. v is one of the possible values for y_k. Each value U(k,v) is computed based on the values in the previous row,
which is U(k-1,:).

Note the typo. in the first equation on page 5 of the notes:
y_k - 1 should be y_{k-1}.
Matt Rodriguez  128
12-03-2007 09:21 PM ET (US)
Quick question about the project

By placing hyphens in novel english words correctly do you mean breaking the word into its syllables?

Thanks,
Matt
RSS link What's this?
All messages    << 129-144  113-128 of 161  97-112 >>
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2008 Internicity Inc. All rights reserved.