| |
Messages 341-340 deleted by topic administrator 07-14-2009 01:38 PM |
|
Charles Elkan
12-17-2008
06:54 PM ET (US)
|
Linguistics/CSE 256, Statistical Natural Language Processing
The goal of this course is to train students to do research in natural language processing ? work that can potentially be published in the leading conferences and journals of the field. Course coverage will be both theoretical (models, algorithms, cognitive implications) and practical (how to get the models and algorithms to work well!). In addition to helping you succeed academically in this field (and related fields including AI, machine learning, and psycholinguistics), this is also great training if you are interested in doing NLP work in industry, either in a research lab (Google, Microsoft, Powerset, Yahoo, etc.) or in a startup.
The course welcomes graduate students in linguistics, computer science, engineering, cognitive science, psychology, and any other discipline who are interested in how to process natural language by computer. Interested postdocs and faculty are welcome to sit in. Highly motivated undergraduates are also encouraged to join the class; just contact me beforehand & let me know your background.
Course topics include:
o Word Segmentation o Language Modeling o Text Categorization o Word-sense Disambiguation o Part-of-speech tagging o Machine translation o Parsing o Computational Semantics o Discourse Processing o Unsupervised Language Learning
Last year's complete syllabus is available here:
http://idiom.ucsd.edu/~rlevy/lign256/winte...gn256_syllabus.html
This time around, I plan on putting a stronger emphasis on unsupervised learning (especially using, though not limited to, nonparametric Bayesian methods), which is taking a position of increasing prominence within the field.
Feel free to contact me with any questions you may have, and I hope to see you in class in January!
Best
Roger Levy
--
Roger Levy Email: rlevy@ling.ucsd.edu Assistant Professor Phone: 858-534-7219 Department of Linguistics Fax: 858-534-4789 UC San Diego Web: http://ling.ucsd.edu/~rlevy
|
|
Charles Elkan
12-17-2008
06:52 PM ET (US)
|
/m337: The mean on the final exam was 90, with standard deviation 23.
Happy holidays to everyone! Charles
|
|
student
12-15-2008
02:11 AM ET (US)
|
when u have graded the finals, can you post the class mean/stdev here? Thank you.
|
|
Charles Elkan
12-11-2008
10:08 AM ET (US)
|
/m330: Many of us used several tutorials and papers as part of the projects, can we use those during the exam? These papers might be very useful to us, and this obviously doesn't break the level playing field as everyone has printer access in the APE labs.
Sorry, I'm going to say no to this. I don't expect it would be very helpful, and I don't want to set off a scramble to copy papers and bring a large stack. You can bring your own handwritten notes made from any source.
If, during the exam, you think a particular formula or fact from a published source would be useful, you can always ask me.
|
|
Charles Elkan
12-11-2008
01:27 AM ET (US)
|
/m333, /m334: It's only a broad meaning because in a narrow meaning of the word "clustering," an item is assumed to be produced by a single true cluster.
In LDA, each word can be belong to multiple topics, so words aren't clustered in the narrow sense. Each document is produced by combining multiple topics, not by selecting any single topic or cluster.
Each theta_m is a probability vector, and we can interpret it as the prior probabilities of a document belonging to each topic. I don't think so. theta_m gives the prior probability that a word in document m belongs to each topic. This is not the same thing as the whole document belonging to the topic.
|
|
Aditya Menon
12-10-2008
08:49 PM ET (US)
|
/m333: Can you clarify why it's only for a broad meaning? My reasoning is that LDA learns a set of theta vectors, one for each document. Each theta_m is a probability vector, and we can interpret it as the prior probabilities of a document belonging to each topic. Hence, theta_m specifies a soft clustering of document m. Is this too simplistic?
|
|
Charles Elkan
12-10-2008
08:35 PM ET (US)
|
Can LDA be categorized as a soft clustering technique?
Sure, but only for a broad meaning of the word "clustering."
|
|
Aditya Menon
12-10-2008
08:28 PM ET (US)
|
/m329: Can LDA be categorized as a soft clustering technique?
|
|
Charles Elkan
12-10-2008
08:09 PM ET (US)
|
/m327: ... CRF are generative models since we could use a sampling method to assess the model parameters in both cases. Sorry, it's not clear what you mean by "assess" here. In any case, I don't know any sampling method used in the context of CRFs, with the small exception of Gibbs sampling as a subroutine of contrastive divergence training.
On the other hand with discriminative models ... the model parameters don't define a distribution space. They often do. With logistic or linear regression the parameters define conditional distributions for y.
A simple example is the perceptron, where the w separating hyperplane is not in anyway informative about distribution of the sample space. w is not informative about x but it is about y given x. w is trained to provide this information, and not to provide information about x. For this reason, the perceptron can be called discriminative, even though it is not probabilistic.
K-means/hierarchical clustering are also discriminative and are very good examples of unsupervised learning. Certainly they are good examples of unsupervised learning, but the adjective "discriminative" doesn't apply to them in any useful meaning that I know.
Note that the word "classification" itself is ambiguous. In ordinary English, it can mean the same thing as clustering, i.e. discovering groups. But in ML it typically means only placing into predefined classes. Other nouns whose meaning can be ambiguous include discovery, recognition, prediction, search, and identification.
|
|
Matan
12-10-2008
08:04 PM ET (US)
|
Many of us used several tutorials and papers as part of the projects, can we use those during the exam? These papers might be very useful to us, and this obviously doesn't break the level playing field as everyone has printer access in the APE labs.
|
|
Charles Elkan
12-10-2008
07:58 PM ET (US)
|
/m325: Below is how I would edit what you wrote. First, let me say that terminology is just terminology. It is ok for different people to use the same terms with broader or narrower meanings, and it can happen that terminological distinctions break down when you push too hard to make them precise.
Second, LDA and other unsupervised methods are not classifiers. Classifiers are examples of supervised learning. Supervised learning means learning to predict a y as a function of an x. So:
The LDA model is a generative classifier. It assumes some generative model for how the observed and unobserved data was produced. The unobserved data are (theta, phi) and labels (Z).
On the other hand, a discriminative classifier takes the observed data as given and assumes a model for how labels are generated. x and y are both observed in training data; only x is observed in test data. A discriminative classifier estimates p(y|x). Training a discriminative classifier chooses model parameters that maximize the likelihood of training labels given a training set of observations.
... discriminative classification is useless for separate from unsupervised learning. ... perceptron, linear regression, KNN, logistic regression, and conditional random fields are discriminative models (is CRF a "classifier"? YES It outputs a sequence of tags, but you could say it "classifies" each item it tags), while LDA is generative. It is notable that the first group of models are examples of supervised learning, while LDA is unsupervised.
|
|
Charles Elkan
12-10-2008
07:42 PM ET (US)
|
/m324, /m326: The exam will be in the usual classroom.
I would like to say yes to using Bishop's book, and other books, but I'm going to say no books allowed. The reason is that other students won't have access to the same books, and we need a level playing field. So, "open book" means my lecture notes and any notes in your own handwriting, from lectures for 250B and/or from your individual study. Copies of your own project reports are allowed also, plus a calculator.
I can't think of anything else that might be useful. If you can, please ask here.
|
|
Matan
12-10-2008
05:29 PM ET (US)
|
The way I understand it a generative model is a model where you can sample from the posterior distribution given the model parameters. Any model for which you can use a sampling approach (for example Gibbs) is a generative model. For example LDA and also CRF are generative models since we could use a sampling method to asses the model parameters in both cases.
On the other hand with discriminative model you can't use sampling as the model parameters don't define a distribution space. A simple example is the the perceptron, where the w separating hyperplane is not in anyway informative about distribution of the sample space.
As to discriminative models for unsupervised learning. I'm not sure but I think the most commonly used 'simple' clustering approaches - K-means/hierarchical clustering - are also discriminative and are very good examples of unsupervised learning.
|
|
Student
12-10-2008
04:51 PM ET (US)
|
Can we use the Bishop book on the exam since it is on the syllabus?
|
|
|