Jaak Vilo
09282009
06:13 AM ET (US)

Sigrid, try out different kernels. See how they work and if possible,, try to find examples where one is better than others or good enough...

Sigrid Pedoson
09262009
11:14 AM ET (US)

I don't understand task 4.

Sigrid Pedoson
09262009
10:58 AM ET (US)

Question about bonus.
How many points will be given, if 1) from first bonus if there are done 2 and 3 in R 2) from second bonus, if there is drawed scatterplot

Sven Laur
09162009
04:21 AM ET (US)

The question Sulev raised is a matter of taste. The resulting list of closed itemsets is the same whether we consider all proper supersets that are frequent or all proper supersets without any restrictions.
Indeed, consider a frequent itemset that is not closed under the strict definition, i.e., there are some proper superset that has the same support. Then this superset must be also frequent and thus the set in question is not closed under the weaker definition. Implication to the other side is trivial.

Karl Potisepp
09162009
12:56 AM ET (US)

The definition of closed frequent itemsets says that if all it's supersets have smaller support than the itemset in question, then it is a closed set. If you can establish that the itemset in question does not have belong to any larger itemsets that have equal support, then it is a closed itemset. If there are no proper supersets to the itemset in question (with smaller support or not), then the amount of "all its proper supersets" is 0, and the criteria applies, and you have found a closed itemset.

Sulev Reisberg
09152009
04:57 PM ET (US)

I have a question about closed itemsets. On slide 26 the definition is "A frequent itemset is closed if all its proper supersets have smaller support". Am I right that "proper superset" stands for superset which has one item more AND MUST ALSO BE A FREQUENT ITEMSET?
I also add an example just to be sure that I am getting it right: Let {a} be a closed frequent itemset with support 10. It has frequent supersets {a,b} with support 5 and {a,c} with support 5. Assuming that there is no more frequent itemsets, are {a,b} and {a,c} also closed frequent itemsets?

Sven Laur
09152009
04:30 AM ET (US)

Superset of A is a set that contains all elements of A and possibly something else. In Estoninan 'ülemhulk' as antonym to 'alamhulk'.

Sigrid Pedoson
09142009
02:58 PM ET (US)

What is "superset"? What is it in Estonian?

Sven Laur
09142009
08:35 AM ET (US)

>I've a question about the definition of maximal frequent itemsets. On the lecture slides, a > frequent itemset is said to be maximal if all it's *proper* supersets are infrequent (emphasis > mine). Does the word 'proper' place some additional requirements to the supersets of the > maximal itemset aside from the fact that the maximal itemset must be contained in them?
Note that A is a superset of A. Thus if A is frequent then not all of its supersets are infrequent, namely A is frequent. A proper superset of A is a superset that is strictly larger than A and this makes the existence of maximal frequent itemsets possible.

Sven Laur
09142009
08:32 AM ET (US)

The questions raised about the lift were correct and by now the formula is in correct form. To make sure I will just repeat the derivation.
First observation: pr[X]=supp(X)/n, pr[Y]=supp(Y)/n, pr[X and Y] = supp(X u Y)/n
Second observation: Lift shows how much P[X and Y] is larger than P[X]P[Y], i.e. how much P[X and Y] is overrepresented compared to the model where X and Y are drawn independently.
Definition
lift(X=>y)= pr[X and Y]/(pr[X] pr[Y]) =(supp(X u Y)/n)/(supp(X)/n * supp(Y)/n)= n* supp(XuY)/(supp(X)* supp(Y))
As a side remark note that lift has an asymmetric range, i.e., lift(X=>Y)\in [0,infinity) whereas logarithm of lift has symmetric range. That is, lift 0.5 means that P[X and Y ] is two times underrepresented and lift 2 means that P[X and y] is two times overrepresented.
As a second side remark note that lift(X=>Y)=lift(Y=>X), i.e., lift is actually a property of a pair X, Y and thus does not have a direction. The latter is one drawback of the lift measure.

Sulev Reisberg
09142009
05:49 AM ET (US)

The lift formula is now inverted, thus my previous post is not relevant anymore (I could not delete it). But I confirm the concern by Rudolf Elbrecht about "1/n", which should be "n" in my opinion also.

Rudolf Elbrecht
09142009
01:22 AM ET (US)

I have a question about the corrected lift formula on the slide 16. Shouldn't the normalisation coefficient for lift, when using absolute values, be "n", not "1/n"? On the slides there is formula: lift(X=>Y) = 1/n * supp(X U Y)/supp(X)*supp(Y), but in my opinion it is not the same as lift(X=>Y) = [supp(X U Y)/n]/[supp(X)/n]*[supp(Y)/n]. Am I just doing some stupid mistake on the calculation or others think also that instead "1/n" there should be just "n" in the first formula? Edited 09142009 01:23 AM

Sulev Reisberg
09132009
11:24 AM ET (US)

I have a question about calculating lift value. On slide 16 of the lecture by Sven Laur there is the following formula given to calculate a lift value: lift (X=>Y) = 1/n * supp(X)supp(Y)/supp(X U Y) We can rewrite this as follows: lift (X=>Y) = [supp(Y)/n] * supp(X)/supp(X U Y) = [supp(Y)/n] / [supp(X U Y)/supp(X)] = [supp(Y)/n] / conf(X=>Y). The first component here shows the probability of getting Y. The second component shows he probability of getting Y WHEN X (conditional probability). Now comes the difficult part: as I understand the aim of calculating lift value is to evaluate the rules, hence I can not imagine why are we dividing in such order: the first component (has nothing to do with the rule) from the second one? Shouldn't we use supp(Y)/n as the basis? In my opinion it would be much logical, if we'd do it vice verca. Let me explain this with an example: Assume that we have a dataset with 1000 transactions. In 100 of them we have Y presented, thus the overall probability of getting Y is 10%. Lets also assume that we have 2 rules: {A}=>{Y} with confidence 50% and {B}=>{Y} with confidence 20%. If we need to know which of these rules worths more attention, then it would be obvious, that we devide the confidence by overall probability of getting Y (10%) as basis: {A}=>{Y}: 50%/10% = 5 {B}=>{Y}: 20%/10% = 2 This would also make a word "lift" a bit reasonable  you can make a rule more important (lift it up) by dividing it (its confidence) by overall probability of getting the result. In the course page there is also a referenced material http://www.borgelt.net/slides/fpm.pdf which seems to use the formula of calculating lift vice verca. Therefore my question is  am I getting it all wrong or does it matter at all which way I calculate it? Edited 09132009 11:28 AM

Karl Potisepp
09112009
04:27 AM ET (US)

I've a question about the definition of maximal frequent itemsets. On the lecture slides, a frequent itemset is said to be maximal if all it's *proper* supersets are infrequent (emphasis mine). Does the word 'proper' place some additional requirements to the supersets of the maximal itemset aside from the fact that the maximal itemset must be contained in them?

Jaak Vilo
09112009
02:25 AM ET (US)

Question was  how to get the meanings of the numbers in the bonus task...
Well, this is one of the problems in publishing data sets that some have been stripped of the meanings.
I did not read the background data  it is possible that original publishers of data have somewhere the mapping between the nrs and the attribute names. If someone finds, please let others know.
From the point of testing ideas, it can often be the case that ypou work on "meaningless" data and then go to customers, who can decide, if what was found was interesting. So  interestingness should come from data, not from meaning, first.
Jaak
Ps. Let's use the mailing list and discussion board for general interest questions.
Karl Blum wrote: > Tere, > > Küsimus boonusülesande kohta. Kas nendel andmebaasidel mida me seal > kasutame ei olegi item'ite tähendusi välja toodud? Õnnetuste andmebaasi > kirjelduses on küll erinevate itemite nimekiri kuid mitte nende seoseid > numbritega. Nii on ju igav mingeid järeldusi teha. > > Karl

KT
09092009
04:50 AM ET (US)

Testing..
