| Aldebaro
|
4
|
 |
|
05-09-2002 05:07 PM ET (US)
|
|
Edited by author 05-09-2002 05:11 PM
Sharing my confused thoughts on dealing with non-independent features mentioned by Dave:
In the first paper we discussed, the authors compare MEMM to a discrete HMM where the observations in each state are modeled by a multinomial function. This is similar to the assumption of independence made by naive Bayes. I can see why this can work well when the observations are words and is problematic for features like f1 = "begins-with-number" and f2 = "begins-with-ordinal". Say that for a given state (e.g. head in the FAQ problem) the probability of f1 is p(f1)=0.1 and p(f2)=0.4. Because f1 and f2 are mutually exclusive, for a line starting with a number, one would get 0.1 * 0.6 and for ordinal 0.9 * 0.4. A multinomial function does not seem a good model in this case.
Given that I believe that MEMM and CRF can deal with such "overlapping" features, what confuses me is that there are also alternatives in the HMM framework. I am trying to find what is wrong with the following reasoning: - For the FAQ problem, I don't want to use words as observations, but the sentence-dependent "overlapping" features. So, for each sentence I extract the features in Table 3 (first paper). - Instead of assuming a multinomial, for each state I use a histogram (then I may have to face the curse of dimensionality.) - Instead of having the observations associated to states (Moore-HMM), I could have them associated to transitions (Mealy-HMM). - In the end I would have a standard Mealy discrete HMM that I could train with Baum-Welch and would be able to use the "overlapping" features, which according to the results of the second paper is the main reason for improved performance.
If the features could depend on the unknown (hidden) states, then I can see that a conventional HMM cannot be used and suspect that MEMM and CRF can. But the features discussed in the two papers don't seem to depend on the states. In other words, assuming the 4 states are "head", "question", etc., if one feature is "is the previous line in the state 'head'?", then I don't see how to use a conventional HMM given that the information about the state of the previous line is unknown during training.
|