QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: Boosted wrapper induction
Views: 521, Unique: 328 
Subscribers: 2
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
All messages            1-7 of 7        
About these ads
Who | When
Messagessort recent-top   
Post a new message
 
sameer agarwal  1
05-16-2001 08:18 AM ET (US)
hi,
I would appreciate a clean description of the weak learner used.
also how do we know that the weak leaner is indeed a weak learner,
meaning, under any distribution it can produce a classifier with 50%+ accuracy ?

sameer
Joe Drish  2
05-16-2001 12:22 PM ET (US)
I found the writing style of this paper to be rather colloquial. For example, they use the word wrapper several times in the beginning of the paper, each time seemingly in a different context with a different meaning. I think they could have simplified the description of their booted wrapper technique.

I also think that the way in which they report their results, in a matter-of-fact way, is good in the sense that they try to present crisp findings, but I think they fall short of providing enough analysis as to why the results are the way they are. Perhaps this is because their method did not do significantly better than the other methods: Figure 4 in which their BWI method is merely "competitive".
Greg Hamerly  3
05-16-2001 12:26 PM ET (US)
This is an interesting paper, but it leaves a lot to imagination. The authors really need to explain their terms when they use them. For instance, in Figure 1, what are the example sets S and E? I presumed they stood for "start" and "end", but then the learner appears to learn prefixes and suffixes at the same time. I agree with sameer, the weak learner should be better described. Perhaps this paper requires a more complete knowledge of wrapper induction.

Additionally I would like to see results comparing BWI to standard WI. This is sort of shown in Figure 4, if we take early boosting rounds to indicate real WI performance. However, I doubt this is the case, because the performance is abysmal -- it doesn't indicate that WI is a useful method (for zero boosting rounds, that is).

- greg
Kristin Branson  4
05-16-2001 12:55 PM ET (US)
I'm a bit confused about the intuition behind this algorithm, as well as the details of the algorithm. I am mostly confused about why both the endpoints and the length of the field are necessary. It seems to me that only two of these are necessary, and that the third should be very dependent on the other two. This algorithm would make more sense to me if, instead of finding the fore detectors independently of the aft detectors and then using the distribution of the lengths of fields to pair them together, maybe just the fore detectors and the lengths were calculated, or just the fore and aft detectors were calculated together.

Perhaps my confusion with the intuition behind this algorithm stems from my confusion with the actual LearnDetector Weak Learner. As I understand it, the LearnnDetector goes through all possible boundaries. For each boundary, it determines the "best" length before the boundary to match a field and the "best" length after the boundary to match a field, for that boundary (I am unclear about this as BestPreExt and BestSufExt are not described). It then compares this to the previous best matches, and saves the current one if it is "better" than the previous best matches. If this is the case, it seems like the algorithm is finding the start and end of a field at the same time, instead of just unconnected fore and aft boundaries. Why doesn't the LearnDetector algorithm just return the best of p and s that it finds?
Gyozo Gidofalvi  5
05-16-2001 12:59 PM ET (US)
I agree with the comments made about the need for a more detailed description of the week learner. Furthermore, i agree with Joe's comment about the lack of analysis of the results. For example BWI was observed to be highly biased towards precision; the short explaination of the authors: "because the learned contextual patterns are highly accurate" was not convincing to me. For several tasks BWI achieved both high precision and recall; what are the characteristics of these tasks compared to other tasks where the recall was significantly lower?

I found the usage of the "lexical" setting for the wildcards appropritate to achieve better performance, but i found the comparisons with other methods not having this information unfair.

What i really liked in the paper was the precise and clear formulation of both the IE task as a classification problem and the definition of a wrapper.

Overall, i found the paper interesting, but i would not consider the approach to be novel, but rather a good usage of boosting.
Melanie Dumas  6
05-16-2001 01:14 PM ET (US)
I completely agree with the previous postings regarding not enough detail on the weak learner. Higher level intuitions for why it works would be very helpful.

I thought the experimental results were well defined and gave clear intentions. I would critique the different evaluation measurements, however. Its interesting that look-ahead is not evaluated by precision, but by CPU time. Overall, I thought the paper presented a pretty good idea of the boosting algorithm.
Dave Kauchak  7
05-16-2001 05:04 PM ET (US)
Edited by author 05-16-2001 05:05 PM
Kristen, I've thought about a few of your comments and also other people, and here are some things/examples that may help clarify the intuitions behind the algorithm.

1. A starting point and a length in many domains does not give us enough information, particularly when there are a wide variety of lengths. Consider the task of extracting web page addresses. A good "fore" detector would be the one described in the presentation and paper which is something starting with "<a href =". Now that we have identified the beginning of the address, we cannot just arbitrarily pick a length. We could, but the results would be poor. Another example is the task of speaker name extraction. Once we have identified the beginning of a speakers name we could just take the next two tokens. But, in some cases the speaker may only use a first name or the speaker might include a middle initial. For this reason we want to be able to identify the end also.

2. One might think, then, that we can just ignore the field length and just identify the beginning and end. But, consider again the speaker example. Say that we match the beginning of the speaker field and then 100 tokens later, we match the end of the field. Likely the speaker's name is not 100 tokens long. The histogram of lengths would tell us this.

3. The reason that the paper uses prefix/suffix and fore/aft is to avoid some of the confusion that many of you seem to be having. The fore and aft relate to boundary detectors. A boundary detector just matches a specific boundary. The fore detectors will identify the boundary that starts a field and the aft will identify the boundary that ends a field. The boundary detectors themselves are also built out of two parts, the prefix and suffix. For a boundary detector to match (for example, to identify the boundary at the beginning of a field to be extracted) the prefix pattern must match the tokens before the boundary and the suffix tokens must match the tokens after. A good example of this is the boundary detector for a web URL. The prefix part is "<a href="" and the suffix part is "http". These to things combined, make up the single fore detector. There would be an appropriate aft detector containing two parts also. It's a little bit weird to get used to, but the suffix part of the fore detector and the prefix part of the aft detector are part of the extracted field (as in http).

4. I agree with many of you in saying that some of the details are left out. S and E were confusing and even the notion of what a token is cause confusion in my mind. I'm still not totally sure what BestPreExt and BestSufExt do right now (but if anyone is real curious I can dive in to the details or ask Dayne himself). I think one of the reasons that these details were left out was due to page limit restrictions, but I'm not positive.

Sorry for the long message. I hope this has helped clarify things. Feel free to ask me any more questions. It took me a while before I fully understood all of the details and intricacies, but once I did, I found it to be an intriguing idea/paper.

Dave
RSS link What's this?
All messages            1-7 of 7        
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2008 Internicity Inc. All rights reserved.