| Who | When |
Messages | |
|
|
|
| Dana Dahlstrom
|
9
|
 |
|
05-07-2002 06:21 PM ET (US)
|
|
Quibble for the day: The features in table 3 seem uncannily apt
to the task at hand; I wonder if any sensible approach could fail
given such salient information. I also wonder if any task besides
parsing text-formatted USENET FAQs could be accomplished using
these features.
|
Kristin Branson
|
8
|
 |
|
05-07-2002 05:59 PM ET (US)
|
|
Edited by author 05-07-2002 06:00 PM
I think that this paper presented an interesting and well-founded alternative to the standard HMM. It allows heuristics to be added to the standard model. I think that the vocabulary feature used in HMMs can also be used in MEMMs, with each feature being a word. MEMMs perhaps may not work ideally with these features because of the huge number of probabilities to learn in MEMMs versus HMMs (in HMMs, you must learn |S|*(|S| + |O|) probabilities, whereas in MEMMs you must learn |S|*|S|*|O| probabilities). I think this is also why MaxEn must be used -- there is not enough training data to accurately predict the transition probabilities, so some assumption about the distribution of the probabilities must be made. This is my guess, anyways. Clarification on this major point would have improved the paper.
HMMs cannot be used to model dependent observations because, as the introduction says, HMMs are not parameterized by the observations. This is because we are keeping track of P(o | s) instead of P(s | o).
My main complaint with this paper is the presentation of using MaxEn as a good thing. The requirement that Max En be used to train this model requires that a big assumption about the data be true. I don't like that Maximum Entropy is in the name of the model, as this attribute of the model is not what distinguishes it from the standard HMM. I am looking forward to today's presentation, as my understanding of Maximum Entropy is not very good, and I think an explanation of MaxEn would help me understand the paper.
|
| Degui Zhi
|
7
|
 |
|
05-07-2002 05:44 PM ET (US)
|
|
In section 2.3 between equation (3) and (4), the author mentioned "the maximum entropy estimation is guaranteed to be ... (b) the same as the maximum likelihood solution..."
So at this point, the MEMM is not different from HMM. Given HMM can also do exponential distribution, it seems that the MEMM is only new in "conditional" part.
Anyway this is a practical work. I am not critisizing its lack of theoretical construction, but its lack of mentioning of theoretical background.
|
| Degui Zhi
|
6
|
 |
|
05-07-2002 05:30 PM ET (US)
|
|
The very basic HMM is based on the assumptions of first order Markov chain, and multinomial emission and transition probabilities, and is trained using standard Baum-Welch algorithm to maximize joint probabilities. However, there are a lot of variations of Markovian assumptions (higher order Markov chain), of emission/transition probabilities ( Exponential distributions or even neural network (simulated) distribution), and of network structures (pair-wise HMM, factorial HMM http://www.cs.berkeley.edu/~murphyk/Bayes/bayes.html). Theoretical people also think about maximizing conditional probabilites. It is nor fair to only compare MEMM to the basic HMM.
|
Dave Kauchak
|
5
|
 |
|
05-07-2002 05:00 PM ET (US)
|
|
Eric: All three of these researchers have done a fair amount of time in academic institutes. Also, Freitag and McCallum have done groundbreaking research not just in information extraction in general, but also in information extraction with HMMs. Check out Dayne Freitag's web page to see a list of papers. Given this, I can't guarantee that the other algorithms weren't tuned perfectly, but my intuition is that these other methods were well developed and thought out (particularly the hmm methods).
|
| Eric Wiewiora
|
4
|
 |
|
05-07-2002 03:50 PM ET (US)
|
|
I found the paper to be very explicit, given that the authors are not in Universities. They explained their algorithms from both an equational and an algorithmic approach, and I think they provided enough details to replicate the algorithm. I am a little suspicious about their results. They briefly describe the competition methods. This would be fine if these methods are well-known and not ambiguous, but my suspicion is that they did not spend much effort making sure the algorithms were performing as well as they could.
|
| Yohan Kim
|
3
|
 |
|
05-07-2002 03:45 PM ET (US)
|
|
I am not clear on 'the most valuable contribution' of this paper (i.e. using state-observation transition functions rather than the separate transition and observation functions in HMMs). To put this in a question, why did authors use maximum entropy framework? Can HMMs not implement similar state-observation transition functions as in MEMM?
Greg: I wasn't clear on what you mean by 'vocabulary-based, free-structure environment.' I am going to assume that a news article satisfies these two requirements since it certainly consists of vocabularies and no structures such as 'question-answer' pairs exist. Consider the problem of extracting the purchasing price from each dodument in a collection of articles describing corporate acquisitions. As was done using HMM in one of the references mentioned in the paper, I think MEMM can be trained with features such as 'set of words indicating an action of buying/acquisition is present' and 'words suggesting company names are present' to solve the problem. Relevent states might be S1 = state that produced tokens such as purchase values of the acquisitions with company names and S2 = state that produced irrelevant words such as 'that, is,...'. I am saying these with no experience of actually applying HMMs and MEMMs to real problems so I welcome anyone spotting holes in my reasonings.
|