QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: Boosted wrapper induction
Views: 533, Unique: 337 
Subscribers: 2
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
All messages            7-7 of 7  1-6 >>
About these ads
Who | When
Messagessort recent-top   
Post a new message
 
Dave Kauchak  7
05-16-2001 05:04 PM ET (US)
Edited by author 05-16-2001 05:05 PM
Kristen, I've thought about a few of your comments and also other people, and here are some things/examples that may help clarify the intuitions behind the algorithm.

1. A starting point and a length in many domains does not give us enough information, particularly when there are a wide variety of lengths. Consider the task of extracting web page addresses. A good "fore" detector would be the one described in the presentation and paper which is something starting with "<a href =". Now that we have identified the beginning of the address, we cannot just arbitrarily pick a length. We could, but the results would be poor. Another example is the task of speaker name extraction. Once we have identified the beginning of a speakers name we could just take the next two tokens. But, in some cases the speaker may only use a first name or the speaker might include a middle initial. For this reason we want to be able to identify the end also.

2. One might think, then, that we can just ignore the field length and just identify the beginning and end. But, consider again the speaker example. Say that we match the beginning of the speaker field and then 100 tokens later, we match the end of the field. Likely the speaker's name is not 100 tokens long. The histogram of lengths would tell us this.

3. The reason that the paper uses prefix/suffix and fore/aft is to avoid some of the confusion that many of you seem to be having. The fore and aft relate to boundary detectors. A boundary detector just matches a specific boundary. The fore detectors will identify the boundary that starts a field and the aft will identify the boundary that ends a field. The boundary detectors themselves are also built out of two parts, the prefix and suffix. For a boundary detector to match (for example, to identify the boundary at the beginning of a field to be extracted) the prefix pattern must match the tokens before the boundary and the suffix tokens must match the tokens after. A good example of this is the boundary detector for a web URL. The prefix part is "<a href="" and the suffix part is "http". These to things combined, make up the single fore detector. There would be an appropriate aft detector containing two parts also. It's a little bit weird to get used to, but the suffix part of the fore detector and the prefix part of the aft detector are part of the extracted field (as in http).

4. I agree with many of you in saying that some of the details are left out. S and E were confusing and even the notion of what a token is cause confusion in my mind. I'm still not totally sure what BestPreExt and BestSufExt do right now (but if anyone is real curious I can dive in to the details or ask Dayne himself). I think one of the reasons that these details were left out was due to page limit restrictions, but I'm not positive.

Sorry for the long message. I hope this has helped clarify things. Feel free to ask me any more questions. It took me a while before I fully understood all of the details and intricacies, but once I did, I found it to be an intriguing idea/paper.

Dave
RSS link What's this?
All messages            7-7 of 7  1-6 >>
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2008 Internicity Inc. All rights reserved.