| Who | When |
Messages | |
|
|
|
Dave Kauchak
|
5
|
 |
|
04-07-2003 04:09 PM ET (US)
|
|
Edited by author 04-07-2003 04:16 PM
/m3: The classic problem of supervised learning is to learn interesting characteristics from a training set and to generalize these characteristics so as to be able to identify new items that have these similar characteristics. When overfitting occurs, the learning method over-learns the characteristics of the training set in that it will not generalize well to new items. In the case of document routing, the method will tend to only identify documents that are very similar to those submitted by the user as interesting because, as Coleman mentioned /m4, the model is more complex, which in this case means more words have been used for query terms. This can also be seen as a tradeoff between precision and recall where overfitting is overfavoring the precision of the system.
|
Dave Kauchak
|
6
|
 |
|
04-07-2003 04:34 PM ET (US)
|
|
One of the nice components of this system is that it makes use of negatively labelled documents. I found this interesting, but also slightly cumbersome. In many situations, it would be difficult for users to select negative documents, particularly negative documents that have similar, but distinguishing characteristics as the positively labelled examples.
I wonder if given certain assumptions, that the system could be altered to use positive examples and unlabelled examples. One assumption that comes to mind is that we may know a priori that the number of possible correct examples in the unlabelled text is small relative the the total number of unlabelled examples.
|
| Andrew Smith
|
7
|
 |
|
04-07-2003 06:08 PM ET (US)
|
|
I don't think it's too cumbersome. With a little interaction, one could use a two-pass method to get the two sets of documents (positively and negatively labelled) starting with only a set of positive documents.
1) Input the positive document set and an empty negative document set. This should retrieve many incorrect documents, which can be used as the negative set, and possibly a few positive documents, which should be added to the positive set. 2) Run the algorithm again with the new negative set, and the old (possibly augmented) positive set.
|
| Dustin Boswell
|
8
|
 |
|
04-09-2003 04:35 AM ET (US)
|
|
From the presentation, someone brought up the question of why the Precision/Recall curves (fig 7.1) were so smooth. (One might expect a jagged curve since the axis is implicitly over 100 documents.)
Looking over the paper again, I just wanted to point out that he mentions that these are "interpolated" (over all queries?) curves. Given that it's some sort of "average" I initially thought the dotted curves were a standard deviation or something. These are actually "best-case" and "worst-case" curves assuming that unjudged docs are either relevant or non-relevant (respectively).
What is curious, is that technically all the other curves for the other systems are "worst-case" curves, since the standard in the field is always to assume unjudged docs are non-relevant. But he goes on to defend this by saying that since Luduan is so "very different" than other systems it needs to consider the unjudged docs as well. (If I am understanding this all correctly...) I just thought that was interesting.
|
| |
Messages 9-17 deleted by topic administrator between 07-11-2008 02:28 AM and 02-22-2008 04:15 PM |
| Nick
|
18
|
 |
|
07-15-2008 03:52 PM ET (US)
|
|
|
|
19
|
 |
|
07-17-2008 05:11 AM ET (US)
|
|
Deleted by topic administrator 07-20-2008 02:17 AM
|
| travestia
|
20
|
 |
|
07-21-2008 01:10 AM ET (US)
|
|
|