QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: CSE 150 Spring 2007
Views: 4497, Unique: 748 
Subscribers: 2
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
All messages    << 164-179  148-163 of 194  132-147 >>
About these ads
Who | When
Messagessort recent-top   
Post a new message
 
mhtongPerson was signed in when posted  148
06-03-2007 04:37 AM ET (US)
Clarification: When I said 20 email chunks, I meant files with 20 emails each, not 20 chunks of unknown amount.
John  149
06-03-2007 04:07 PM ET (US)
Matt, Stephen was saying that even if you split up the emails into chunks of 20 emails each, that doesn't help at all, since different ways of splitting up the emails can still yield different results. We need a standardized way of splitting them, so everyone can get the same results
mhtongPerson was signed in when posted  150
06-03-2007 08:46 PM ET (US)
Edited by author 06-03-2007 08:54 PM
So, as I said in the assignment, the files are in mbox format, the most commonly used format. So I'm a bit surprised that there's any problem getting them chopped up into separate emails. If you look up the format, there's a specified way of how to break them up, namely lines starting with "From ", with a blank line appended at the end. This is the official specification. I don't actually get the counts I'd expect (I think I get 290 and 598 with my quick check), but I did say they were mbox files so you should treat them as such.....

So, for instance, this is from the second hit Google provides:

  A reader scans through an mbox file looking for From_ lines.
          Any From_ line marks the beginning of a message. The reader
          should not attempt to take advantage of the fact that every
          From_ line (past the beginning of the file) is preceded by a
          blank line.

          Once the reader finds a message, it extracts a (possibly
          corrupted) envelope sender and delivery date out of the
          From_ line. It then reads until the next From_ line or end
          of file, whichever comes first. It strips off the final
          blank line and deletes the quoting of >From_ lines and
          >>From_ lines and so on. The result is an RFC 822 message.

Aside from the first msg, the following RegExp should work: "\n\nFrom .*\n". (That's from the next Google hit). So it seems like the answer to your question was out there with a pretty minimal amount of digging.....
Steffan McMurrin  151
06-04-2007 04:18 AM ET (US)
How Long are ppl's learners taking? This seems really slow...

-bash-2.05b$ time java spamNBLearner > spamParams.txt

real 3m20.461s
user 3m22.930s
sys 0m2.360s

Almost all of the time is spent reading the email files in.
mhtong  152
06-04-2007 05:15 PM ET (US)
Slogging through the questions built up about the HW (from here, discussion, and email):

10.1 Question from Erik - You should probably model your answers on the effect axioms described in section 10.3. These use both Poss (to check preconditions) and Result (to refer to the appropriate resulting state).

10.4 Question from Josh & discussion - A lot of this I answered, but I missed the mereological part. Basically mereology is a branch that provides an alternative to set theory, more or less providing an alternative to set theory based on partof relations. Sometimes this makes sense: for instance since set theory assumes atomic elements, elemOf(x, Water) ^ partof(y, x) => elemOf(y, Water) seems very odd, since x and y are atomic elements of the set of things that are Water. On the other hand, partof(x, Water) ^ partof(y,x) => partof(y, Water) seems very natural. So for the second half, you use partof relations instead of predicates (e.g. Water(x)) or set theory (elemof(x, Water) or "x \in Water"). Don't stress it too much, the focus is more on the first part.

11.4 As Josh pointed out, you need a "Holding" fluent and an "At" fluent - both are mentioned, but not made explicit. It already mentions "Go" and "Push" actions, so I disagree about the need for an additional "Move" action. It is indeed a blockworld-like world with only 3 positions.

I don't think I've gotten any question from Chapter 13 questions

20.13 - "Are we supposed to be describing what happens to the weights of a still single layered perceptron like it was talking about in the first sentence?" Yes. 4 input neurons feeding into one perceptron output unit trying to compute parity. "Does four-input refer to 00, 01, 10, and 11, or are there actually 4 bits of input?" Four bits of input. "Are we basically showing why single layer perceptrons fail?" I'd say partly yes, and partly getting some hands-on practice with perceptrons.
mhtong  153
06-04-2007 05:22 PM ET (US)
I'm guessing/hoping from the lack of continued complaints that people are feeling more comfy with mbox parsing? There actually are some toolboxes out there for mbox reading (one student at least has been using the Python tool box). It seems like the format is reasonably well spelled out, so it doesn't seem like there should be disagreement. If it's something people are worried about, like I said I'd be willing to break up the test set into small chunks with known #s of emails so you can either fix things (by hand if necessary) or at least not be penalized for being off by one or somesuch.
mhtongPerson was signed in when posted  154
06-04-2007 08:30 PM ET (US)
For 13.8, they give exact numbers so you should use them to support your argument numerically (ie do a calculation).

For 20.13, I'd want some written out calculation of how it would change - Gary's indicated that that sort of thing is likely to be on the test, so you should take the chance to practice.
mhtongPerson was signed in when posted  155
06-05-2007 08:56 PM ET (US)
I posted the slides from the last couple sections on the webpage. Last week's in particular has a lot of tips and explanation of the current project and is at: http://www.cse.ucsd.edu/classes/sp07/cse150/section/Section8.pdf.
Erik Corona  156
06-06-2007 04:31 PM ET (US)
This is my count, can anyone else verify this?

Spam: 598
Not Spam: 293
Erik Peterson  157
06-06-2007 09:59 PM ET (US)
Got those exact same numbers, in both our parsers. My personal parser used Java. Used "From "
Stephen BoydPerson was signed in when posted  158
06-06-2007 10:08 PM ET (US)
Using the python mailbox parser I got 598 spam and 291 ham, but originally I was using just a regex and I got 598 spam and 293 ham. Looks like something is wrong with the ham training set, because it looks like everyone can agree on the spam.
Tony  159
06-06-2007 11:53 PM ET (US)
I also get 598 spam and 293 ham using the python mailbox module.

Also, would it be possible for a project extension since specifications and files for the 20 emails per file format aren't up yet? It would greatly ease the 10th week pressure and what not =).
Tony  160
06-07-2007 01:25 AM ET (US)
I was messing around the the mailbox module and discovered that if you use mailbox.PortableUnixMailbox() you'll get 293 for the ham count
and you would get a ham count of 291 if you used
mailbox.UnixMailbox()

I was using UnixMailbox yesterday, but switched up to using PortableUnixMailbox since it was recommended over the regular UnixMailbox by some of the sites I found.
Robin  161
06-07-2007 06:55 PM ET (US)
Wondering what everyone is getting trying to classify the test sets?
For us, currently:
trainSpam4 = 99.14% correct
trainHam3 = 77.13% correct

We are having a tough time classifying ham messages as ham, especially ones that are not written by a human and short messages that do not contain any of our feature words. Any suggestions.
mhtong  162
06-08-2007 12:48 AM ET (US)
Edited by author 06-08-2007 01:10 AM
If it's in the training set, it's perfectly legit to just add more feature words based on your performance... Feature selection is a bit of an art, not a science (at least at this stage - more scientific and principled approaches to feature selection are pretty active areas of research).
Kei Shun Ma  163
06-08-2007 10:10 PM ET (US)
Will we have any final review and lecture notes posted online?
RSS link What's this?
All messages    << 164-179  148-163 of 194  132-147 >>
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2008 Internicity Inc. All rights reserved.