| Nikola Jevtic
|
1
|
 |
|
05-11-2002 05:43 AM ET (US)
|
|
I will try to be short but precise. The results mentioned in the paper are not at all surprising. If a problem is defined as compressing an iid source of unknown distribution, and as a performance measure is used the worst redundancy (extra # of bits on top of the the entropy of the source) when the adversary gets to choose both x^n and the source parameters, then: provably the smallest achievable redundancy is: (k-1)/2*log(n/2pi)+C(k)+o(1). As far as "add 1/2" rule is concerned, if all the symbol counts (ni) grow linearly with n, redundancy of add 1/2 is asymptotically optimal (is equal to the above). However at the vertex points (sun always rises), it is slightly worse than the optimal coding scheme, and the redundancy is: (k-1)/2*log(n/pi)+C(k)+o(1). What Ristad is saying is that it is (close to) optimal when the number of symbol is k, but if it were smaller maybe we are not comparing it to the right "best possible performance". It is absolutely correct, but perhaps trivial. by applying his scheme to add 1/2 rule we could : 1) send log(k) bits to say how many symbols we saw in x^n 2) send log(choose(k,q)) bits to describe the active alphabet 3) apply add 1/2 rule to the reduced alphabet, and this time have redundancy be bounded by : (q-1)/2*log(n/pi)+C(k)+o(1) (+log k + log choose(k,q))
If someone thinks it happens rarely that alphabet is smaller and that we pay too much extra (log k) if there is nothing to be saved we can just send 1 bit to say if savings is possible and if so, then send log k ...
Performance analysis: If we compare probabilities he assigns to sequences and those that we do by using modified add 1/2 rule it can be shown that if all symbols among the active q grow linearly with n, there is a constant factor between the two methods. However if any of the {ni} grows slower then linear in n, modified add 1/2 rule beats his scheme polynomially in n. (Note that it is impossible to give higher probability to all the sequences, as they all need to add up to 1.)
Experiments: It has not been (perhaps) said clear enough that experimental results may only be vaguely correlated with the theoretical analysis, as block probabilities and derived conditional probabilities may differ at will. (More precisely original block probabilities that he calls priors, and block probabilities for x^n computed as a sequence of conditional probabilites). Or at least no reason to believe differently was provided. This said it would be interesting to see how well would 'modified add 1/2 rule' perform because it could go either way.
Philosophical aspects: In the paper it was not explicitly said what we expect of the method. (E.g. to learn iid source) which then allows for bogus examples like broken random number generator. In reality, one can not hope to be able to learn 'any' source. It is only possible to restrict to a certain set of sources (e.g. iid, markov...) and try to be close to the best source in the closed set. If the set is unrestricted the worst case redundancy is always infinite.
|
| Bret Ehlert
|
2
|
 |
|
05-16-2002 05:23 AM ET (US)
|
|
Although I found the sunrise, Martian elephant, and random number generator examples amusing, I think the paper would have been a lot more interesting if the theoretical discussion was about more realistic problems. The results from experiments on the Calgary and natural language corpora are intriguing and I wish more explanation was given as to why the new approach works better than the traditional approaches for these problems.
|
| Eric Wiewiora
|
3
|
 |
|
05-16-2002 04:24 PM ET (US)
|
|
It seems that any multinomial estimation scheme assumes something about the input. The method in the paper is less biased towards the uniform distribution than Laplace's method. I assume that some bias towards uniform distribution is desired, or else you would simply assign
P(i) = (observations of i) / (total observations).
I think this method would behave similarly, if not "better", in many of his toy examples.
|
Dave Kauchak
|
4
|
 |
|
05-16-2002 05:22 PM ET (US)
|
|
I thought this paper was interesting and found some of the toy examples in the beginning particularly amusing. However, as others have mentioned, I wish the paper had given a bit more real world motivation. Even the real world tasks that were covered (byte and word prediction) seem somewhat useless (though I could be wrong). I would have preferred to see motivating examples that actually needed to be solved.
I also thought the paper could have done a much better job explaining notation. Although most of the predictive methods are fairly straightforward, I found myself confused sometimes because I couldn't figure out exactly what n or some other variable stood for.
|