QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: Maximum Entropy Markov Models for InfoMaximum Entropy Markov Models
Views: 285, Unique: 208 
Subscribers: 2
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
All messages            1-9 of 9        
About these ads
Who | When
Messagessort recent-top   
Post a new message
 
Dave KauchakPerson was signed in when posted  1
05-06-2002 11:44 PM ET (US)
I found the new approach to segmentation and information extraction fairly interesting. I'm glad the authors gave some introductory material on HMMs as a bit of a refresher and for comparison purposes. However, I would like to have seen a bit more explaination of the differences (and implications of the differences) between HMMs and MEMMs.

As with other papers that I have seen by these authors, I found some of the notation excessive. I think some of the ideas and algorithms could have been presented more clearly (though, to be fair, the authors do a good job of this in parts of the paper, just not all).

Overall, I think the paper could have benefitted from a bit more length (such as a journal publication). In particular, I would have liked to see the experimental section flushed out a bit. How well does this method perform on IE tasks? Along these same lines, I would liked to have seen a better experimental comparison between methods (particularly HMMs). One last note, I think the authors should have included more information about the dataset. How many question/answer pairs are on a given page? I assume more than just a couple, since they train on a single page.
Greg HamerlyPerson was signed in when posted  2
05-07-2002 03:11 AM ET (US)
I liked their approach, but I read this paper 9 months ago and was totally confused. This time around it makes more sense, especially after Aldebaro's talk on discriminative vs. generative models.

My main confusion this time was on how the features (f_<b,s>) is connected with the probabilities, but equations 2 and 4 make it fairly clear. The features chosen seem to be good, if heuristically chosen. However, my biggest complaint is that they appear to be ignoring MOST word/vocabulary information! This makes the model significantly different from the HMMs I am used to. True, the features could be used for vocabulary, but it seems that it would be very hard to adapt this model to a vocabulary-based, free-structure environment. In other words, this model seems good for highly structured text (where whitespace and certain keywords play a big role), but not for unstructured text. Incorporating word frequency information seems to be very difficult.
Yohan Kim  3
05-07-2002 03:45 PM ET (US)
I am not clear on 'the most valuable contribution' of this paper (i.e. using state-observation transition functions rather than the separate transition and observation functions in HMMs). To put this in a question, why did authors use maximum entropy framework? Can HMMs not implement similar state-observation transition functions as in MEMM?

Greg:
I wasn't clear on what you mean by 'vocabulary-based, free-structure environment.' I am going to assume that a news article satisfies these two requirements since it certainly consists of vocabularies and no structures such as 'question-answer' pairs exist. Consider the problem of extracting the purchasing price from each dodument in a collection of articles describing corporate acquisitions. As was done using HMM in one of the references mentioned in the paper, I think MEMM can be trained with features such as 'set of words indicating an action of buying/acquisition is present' and 'words suggesting company names are present' to solve the problem. Relevent states might be S1 = state that produced tokens such as purchase values of the acquisitions with company names and S2 = state that produced irrelevant words such as 'that, is,...'. I am saying these with no experience of actually applying HMMs and MEMMs to real problems so I welcome anyone spotting holes in my reasonings.
Eric Wiewiora  4
05-07-2002 03:50 PM ET (US)
I found the paper to be very explicit, given that the authors are not in Universities. They explained their algorithms from both an equational and an algorithmic approach, and I think they provided enough details to replicate the algorithm.
I am a little suspicious about their results. They briefly describe the competition methods. This would be fine if these methods are well-known and not ambiguous, but my suspicion is that they did not spend much effort making sure the algorithms were performing as well as they could.
Dave KauchakPerson was signed in when posted  5
05-07-2002 05:00 PM ET (US)
Eric:
All three of these researchers have done a fair amount of time in academic institutes. Also, Freitag and McCallum have done groundbreaking research not just in information extraction in general, but also in information extraction with HMMs. Check out Dayne Freitag's web page to see a list of papers. Given this, I can't guarantee that the other algorithms weren't tuned perfectly, but my intuition is that these other methods were well developed and thought out (particularly the hmm methods).
Degui Zhi  6
05-07-2002 05:30 PM ET (US)
The very basic HMM is based on the assumptions of first order Markov chain, and multinomial emission and transition probabilities, and is trained using standard Baum-Welch algorithm to maximize joint probabilities.

However, there are a lot of variations of Markovian assumptions (higher order Markov chain), of emission/transition probabilities ( Exponential distributions or even neural network (simulated) distribution), and of network structures (pair-wise HMM, factorial HMM http://www.cs.berkeley.edu/~murphyk/Bayes/bayes.html). Theoretical people also think about maximizing conditional probabilites.

It is nor fair to only compare MEMM to the basic HMM.
Degui Zhi  7
05-07-2002 05:44 PM ET (US)
In section 2.3 between equation (3) and (4), the author mentioned "the maximum entropy estimation is guaranteed to be ... (b) the same as the maximum likelihood solution..."

So at this point, the MEMM is not different from HMM. Given HMM can also do exponential distribution, it seems that the MEMM is only new in "conditional" part.

Anyway this is a practical work. I am not critisizing its lack of theoretical construction, but its lack of mentioning of theoretical background.
Kristin BransonPerson was signed in when posted  8
05-07-2002 05:59 PM ET (US)
Edited by author 05-07-2002 06:00 PM
I think that this paper presented an interesting and well-founded alternative to the standard HMM. It allows heuristics to be added to the standard model. I think that the vocabulary feature used in HMMs can also be used in MEMMs, with each feature being a word. MEMMs perhaps may not work ideally with these features because of the huge number of probabilities to learn in MEMMs versus HMMs (in HMMs, you must learn |S|*(|S| + |O|) probabilities, whereas in MEMMs you must learn |S|*|S|*|O| probabilities). I think this is also why MaxEn must be used -- there is not enough training data to accurately predict the transition probabilities, so some assumption about the distribution of the probabilities must be made. This is my guess, anyways. Clarification on this major point would have improved the paper.

HMMs cannot be used to model dependent observations because, as the introduction says, HMMs are not parameterized by the observations. This is because we are keeping track of P(o | s) instead of P(s | o).

My main complaint with this paper is the presentation of using MaxEn as a good thing. The requirement that Max En be used to train this model requires that a big assumption about the data be true. I don't like that Maximum Entropy is in the name of the model, as this attribute of the model is not what distinguishes it from the standard HMM. I am looking forward to today's presentation, as my understanding of Maximum Entropy is not very good, and I think an explanation of MaxEn would help me understand the paper.
Dana Dahlstrom  9
05-07-2002 06:21 PM ET (US)
Quibble for the day: The features in table 3 seem  uncannily  apt
to the task at hand; I wonder if any sensible approach could fail
given such salient information. I also wonder if any task besides
parsing  text-formatted  USENET  FAQs could be accomplished using
these features.
RSS link What's this?
All messages            1-9 of 9        
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2008 Internicity Inc. All rights reserved.