QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: Conditional Random Fields
Views: 206, Unique: 140 
Subscribers: 0
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
About these ads
Who | When
Messagessort recent-bottom   
Post a new message
 
Yohan Kim  7
05-09-2002 05:42 PM ET (US)
Some comments on equation (1):

Even with many details not making clear sense, I venture to throw some light on this equation. Equation (1) looks similar to Boltzmann factor. The terms inside the exponential correspond to something like an energy function with some parameters. Equation (1) is assigning some probability to a graph with its particular configuration determined by Y conditioned on X. 'Lower' the energy of the configuration, greater the probability assigned to this particular configuration.
Degui Zhi  6
05-09-2002 05:17 PM ET (US)
Some comments on experiments:
I think the idea of testing models using HMM simulation is nice (maybe it is the routine in this field, however, the authors tactically didn't do this in the last paper, since they were comparing MEMM vs. HMM)

In section 5.1, I am suprised to see MEMM can have a error 42%. According to its structure, MEMM can only learn the frequency of each words as transition probabilities from state 0. How come it can be better than random guess? It must be correlation between word frequencies of the training set and the test set.
Also in section 5.1, CRF have error rate 4.6%. It seems this result is close to Bayes error BigO(1/32).

Why CRF can be better? Intuitively, I think CRF removed the direction arrows in the graph so that the belief message can be transmitted in both direction during inference. Seeing a "i" as second symbol, the model may be able to back propagate this message to "reconsider" the decision made when seeing previous symbol. Anyway, this is only intuition. I fail to understand many of the derivations so I cannot check the mathematical proof. Hope I could follow the derivation during the talk.

Generally, I think this paper is better than the previous one in both theory and experiment sections, though I still share the same worry as Dave:" What exactly is it about generative(or hidden) models that makes it difficult to deal with non-independent features? "
Dana Dahlstrom  5
05-09-2002 05:14 PM ET (US)
Can anyone make it clear why the  ``label  bias  problem''  is  a
problem?    The   authors'  explanation  is  lost  on  me:  ``the
transitions leaving a given state compete only against each other
[sic],  rather than against all other transitions in the model.''
Why is this a problem? In a given state, only transitions leaving
that state can be taken, right?

I hoped the example in figure 1 would help, but it just  seems  a
useless  model for distinguishing between ``rib'' and ``rob'': it
forces a commitment to one or the other on the first letter, when
the  two  words  simply  can't  be  disambiguated. Why use such a
model? What is the point here? What is this ``score  mass''  they
invoke? (Anyone read Bottou 1991?)
Aldebaro  4
05-09-2002 05:07 PM ET (US)
Edited by author 05-09-2002 05:11 PM
Sharing my confused thoughts on dealing with non-independent features mentioned by Dave:

In the first paper we discussed, the authors compare MEMM to a discrete HMM where the observations in each state are modeled by a multinomial function. This is similar to the assumption of independence made by naive Bayes. I can see why this can work well when the observations are words and is problematic for features like f1 = "begins-with-number" and f2 = "begins-with-ordinal". Say that for a given state (e.g. head in the FAQ problem) the probability of f1 is p(f1)=0.1 and p(f2)=0.4. Because f1 and f2 are mutually exclusive, for a line starting with a number, one would get 0.1 * 0.6 and for ordinal 0.9 * 0.4. A multinomial function does not seem a good model in this case.

Given that I believe that MEMM and CRF can deal with such "overlapping" features, what confuses me is that there are also alternatives in the HMM framework. I am trying to find what is wrong with the following reasoning:
- For the FAQ problem, I don't want to use words as observations, but the sentence-dependent "overlapping" features. So, for each sentence I extract the features in Table 3 (first paper).
- Instead of assuming a multinomial, for each state I use a histogram (then I may have to face the curse of dimensionality.)
- Instead of having the observations associated to states (Moore-HMM), I could have them associated to transitions (Mealy-HMM).
- In the end I would have a standard Mealy discrete HMM that I could train with Baum-Welch and would be able to use the "overlapping" features, which according to the results of the second paper is the main reason for improved performance.

If the features could depend on the unknown (hidden) states, then I can see that a conventional HMM cannot be used and suspect that MEMM and CRF can. But the features discussed in the two papers don't seem to depend on the states. In other words, assuming the 4 states are "head", "question", etc., if one feature is "is the previous line in the state 'head'?", then I don't see how to use a conventional HMM given that the information about the state of the previous line is unknown during training.
Degui Zhi  3
05-09-2002 04:05 PM ET (US)
Hi, class:

There is a talk by one of the fathers of HMM:

Today at 1:30, CMRR Auditorium

Dr. Larry Rabiner, digital signal processing pioneer,
author of four major books in the area, and recent
head of AT&T research, will share his views on
communication technology in the 21st century.
Dave KauchakPerson was signed in when posted  2
05-09-2002 03:33 PM ET (US)
What exactly is it about generative models that makes it difficult to deal with non-independent features? This concept has been brought up in this paper and the last and still isn't quite clear to me.

I agree with Aldebaro and found it a bit interesting that convergence was a problem. I would liked to have seen a better explaination of why CRFs converge so much slower than MEMMs.

I thought the paper did a good job of combining both theoretical and experimental information in such a short paper. In particular, I thought the combination of a couple of lesion studies as well as a real world experiment made for a nice combination in the experimental section.
Aldebaro  1
05-09-2002 04:03 AM ET (US)
Does anyone have an intuitive explanation for Eq. (1) (obtained by the fundamental theorem of random fields)?

While reading the theoretical part I got the impression that convergence would not be a problem for CRF, given that the authors mention: a) the loss is convex in the case of fully observable states and b) Eq. (2) can be easily computed and c) that a single iteration of their algorithms is equivalently to Baum-Welch (mentioned after Eq. (2), where I assume they meant one iteration of Baum-Welch). But, for my surprise, the experimental results showed that convergence is a big issue.
RSS link What's this?
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2006 Internicity Inc. All rights reserved.