| Who | When |
Messages | |
|
|
|
| John
|
149
|
 |
|
06-03-2007 04:07 PM ET (US)
|
|
Matt, Stephen was saying that even if you split up the emails into chunks of 20 emails each, that doesn't help at all, since different ways of splitting up the emails can still yield different results. We need a standardized way of splitting them, so everyone can get the same results
|
mhtong
|
150
|
 |
|
06-03-2007 08:46 PM ET (US)
|
|
Edited by author 06-03-2007 08:54 PM
So, as I said in the assignment, the files are in mbox format, the most commonly used format. So I'm a bit surprised that there's any problem getting them chopped up into separate emails. If you look up the format, there's a specified way of how to break them up, namely lines starting with "From ", with a blank line appended at the end. This is the official specification. I don't actually get the counts I'd expect (I think I get 290 and 598 with my quick check), but I did say they were mbox files so you should treat them as such.....
So, for instance, this is from the second hit Google provides:
A reader scans through an mbox file looking for From_ lines. Any From_ line marks the beginning of a message. The reader should not attempt to take advantage of the fact that every From_ line (past the beginning of the file) is preceded by a blank line.
Once the reader finds a message, it extracts a (possibly corrupted) envelope sender and delivery date out of the From_ line. It then reads until the next From_ line or end of file, whichever comes first. It strips off the final blank line and deletes the quoting of >From_ lines and >>From_ lines and so on. The result is an RFC 822 message.
Aside from the first msg, the following RegExp should work: "\n\nFrom .*\n". (That's from the next Google hit). So it seems like the answer to your question was out there with a pretty minimal amount of digging.....
|
| Steffan McMurrin
|
151
|
 |
|
06-04-2007 04:18 AM ET (US)
|
|
How Long are ppl's learners taking? This seems really slow...
-bash-2.05b$ time java spamNBLearner > spamParams.txt
real 3m20.461s user 3m22.930s sys 0m2.360s
Almost all of the time is spent reading the email files in.
|
| mhtong
|
152
|
 |
|
06-04-2007 05:15 PM ET (US)
|
|
Slogging through the questions built up about the HW (from here, discussion, and email):
10.1 Question from Erik - You should probably model your answers on the effect axioms described in section 10.3. These use both Poss (to check preconditions) and Result (to refer to the appropriate resulting state).
10.4 Question from Josh & discussion - A lot of this I answered, but I missed the mereological part. Basically mereology is a branch that provides an alternative to set theory, more or less providing an alternative to set theory based on partof relations. Sometimes this makes sense: for instance since set theory assumes atomic elements, elemOf(x, Water) ^ partof(y, x) => elemOf(y, Water) seems very odd, since x and y are atomic elements of the set of things that are Water. On the other hand, partof(x, Water) ^ partof(y,x) => partof(y, Water) seems very natural. So for the second half, you use partof relations instead of predicates (e.g. Water(x)) or set theory (elemof(x, Water) or "x \in Water"). Don't stress it too much, the focus is more on the first part.
11.4 As Josh pointed out, you need a "Holding" fluent and an "At" fluent - both are mentioned, but not made explicit. It already mentions "Go" and "Push" actions, so I disagree about the need for an additional "Move" action. It is indeed a blockworld-like world with only 3 positions.
I don't think I've gotten any question from Chapter 13 questions
20.13 - "Are we supposed to be describing what happens to the weights of a still single layered perceptron like it was talking about in the first sentence?" Yes. 4 input neurons feeding into one perceptron output unit trying to compute parity. "Does four-input refer to 00, 01, 10, and 11, or are there actually 4 bits of input?" Four bits of input. "Are we basically showing why single layer perceptrons fail?" I'd say partly yes, and partly getting some hands-on practice with perceptrons.
|
| mhtong
|
153
|
 |
|
06-04-2007 05:22 PM ET (US)
|
|
I'm guessing/hoping from the lack of continued complaints that people are feeling more comfy with mbox parsing? There actually are some toolboxes out there for mbox reading (one student at least has been using the Python tool box). It seems like the format is reasonably well spelled out, so it doesn't seem like there should be disagreement. If it's something people are worried about, like I said I'd be willing to break up the test set into small chunks with known #s of emails so you can either fix things (by hand if necessary) or at least not be penalized for being off by one or somesuch.
|
mhtong
|
154
|
 |
|
06-04-2007 08:30 PM ET (US)
|
|
For 13.8, they give exact numbers so you should use them to support your argument numerically (ie do a calculation).
For 20.13, I'd want some written out calculation of how it would change - Gary's indicated that that sort of thing is likely to be on the test, so you should take the chance to practice.
|
mhtong
|
155
|
 |
|
06-05-2007 08:56 PM ET (US)
|
|
|
| Erik Corona
|
156
|
 |
|
06-06-2007 04:31 PM ET (US)
|
|
This is my count, can anyone else verify this?
Spam: 598 Not Spam: 293
|
| Erik Peterson
|
157
|
 |
|
06-06-2007 09:59 PM ET (US)
|
|
Got those exact same numbers, in both our parsers. My personal parser used Java. Used "From "
|
Stephen Boyd
|
158
|
 |
|
06-06-2007 10:08 PM ET (US)
|
|
Using the python mailbox parser I got 598 spam and 291 ham, but originally I was using just a regex and I got 598 spam and 293 ham. Looks like something is wrong with the ham training set, because it looks like everyone can agree on the spam.
|
| Tony
|
159
|
 |
|
06-06-2007 11:53 PM ET (US)
|
|
I also get 598 spam and 293 ham using the python mailbox module.
Also, would it be possible for a project extension since specifications and files for the 20 emails per file format aren't up yet? It would greatly ease the 10th week pressure and what not =).
|
| Tony
|
160
|
 |
|
06-07-2007 01:25 AM ET (US)
|
|
I was messing around the the mailbox module and discovered that if you use mailbox.PortableUnixMailbox() you'll get 293 for the ham count and you would get a ham count of 291 if you used mailbox.UnixMailbox()
I was using UnixMailbox yesterday, but switched up to using PortableUnixMailbox since it was recommended over the regular UnixMailbox by some of the sites I found.
|
| Robin
|
161
|
 |
|
06-07-2007 06:55 PM ET (US)
|
|
Wondering what everyone is getting trying to classify the test sets? For us, currently: trainSpam4 = 99.14% correct trainHam3 = 77.13% correct
We are having a tough time classifying ham messages as ham, especially ones that are not written by a human and short messages that do not contain any of our feature words. Any suggestions.
|
| mhtong
|
162
|
 |
|
06-08-2007 12:48 AM ET (US)
|
|
Edited by author 06-08-2007 01:10 AM
If it's in the training set, it's perfectly legit to just add more feature words based on your performance... Feature selection is a bit of an art, not a science (at least at this stage - more scientific and principled approaches to feature selection are pretty active areas of research).
|
| Kei Shun Ma
|
163
|
 |
|
06-08-2007 10:10 PM ET (US)
|
|
Will we have any final review and lecture notes posted online?
|
| Erik
|
164
|
 |
|
06-10-2007 03:55 AM ET (US)
|
|
Are we allowed a cheat sheet? One-sided?
|