QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: A natural law of succession
Views: 203, Unique: 161 
Subscribers: 1
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
All messages            1-4 of 4        
About these ads
Who | When
Messagessort recent-top   
Post a new message
 
Nikola Jevtic  1
05-11-2002 05:43 AM ET (US)
I will try to be short but precise. The results
mentioned in the paper are not at all surprising.
If a problem is defined as compressing an iid source
of unknown distribution, and as a performance measure is used
the worst redundancy (extra # of bits on top of the
the entropy of the source) when the adversary gets to
choose both x^n and the source parameters, then:
provably the smallest achievable redundancy is:
(k-1)/2*log(n/2pi)+C(k)+o(1).
As far as "add 1/2" rule is concerned, if all the symbol
counts (ni) grow linearly with n, redundancy of add 1/2 is
asymptotically optimal (is equal to the above).
However at the vertex points (sun always rises), it is slightly
worse than the optimal coding scheme, and the redundancy is:
(k-1)/2*log(n/pi)+C(k)+o(1).
What Ristad is saying is that it is (close to) optimal
when the number of symbol is k, but if it were smaller maybe we
are not comparing it to the right "best possible performance".
It is absolutely correct, but perhaps trivial.
by applying his scheme to add 1/2 rule we could :
1) send log(k) bits to say how many symbols we saw in x^n
2) send log(choose(k,q)) bits to describe the active alphabet
3) apply add 1/2 rule to the reduced alphabet, and this time
have redundancy be bounded by :
(q-1)/2*log(n/pi)+C(k)+o(1) (+log k + log choose(k,q))

If someone thinks it happens rarely that alphabet is smaller
and that we pay too much extra (log k) if there is nothing to
be saved we can just send 1 bit to say if savings is possible
and if so, then send log k ...

Performance analysis:
If we compare probabilities he assigns to sequences and those that we
do by using modified add 1/2 rule it can be shown that if all
symbols among the active q grow linearly with n, there is a constant
factor between the two methods. However if any of the {ni} grows
slower then linear in n, modified add 1/2 rule beats his scheme polynomially
in n. (Note that it is impossible to give higher probability to all the
sequences, as they all need to add up to 1.)

Experiments:
It has not been (perhaps) said clear enough that experimental results
may only be vaguely correlated with the theoretical analysis, as
block probabilities and derived conditional probabilities may differ at will.
(More precisely original block probabilities that he calls priors, and
block probabilities for x^n computed as a sequence of conditional probabilites).
Or at least no reason to believe differently was provided.
This said it would be interesting to see how well would 'modified add 1/2 rule'
perform because it could go either way.

Philosophical aspects:
In the paper it was not explicitly said what we expect of the method.
(E.g. to learn iid source) which then allows for bogus examples
like broken random number generator.
In reality, one can not hope to be able to learn 'any' source.
It is only possible to restrict to a certain set of sources (e.g. iid, markov...)
and try to be close to the best source in the closed set.
If the set is unrestricted the worst case redundancy is always infinite.
Bret Ehlert  2
05-16-2002 05:23 AM ET (US)
Although I found the sunrise, Martian elephant, and random number generator examples amusing, I think the paper would have been a lot more interesting if the theoretical discussion was about more realistic problems. The results from experiments on the Calgary and natural language corpora are intriguing and I wish more explanation was given as to why the new approach works better than the traditional approaches for these problems.
Eric Wiewiora  3
05-16-2002 04:24 PM ET (US)
It seems that any multinomial estimation scheme assumes something about the input. The method in the paper is less biased towards the uniform distribution than Laplace's method. I assume that some bias towards uniform distribution is desired, or else you would simply assign

P(i) = (observations of i) / (total observations).

I think this method would behave similarly, if not "better", in many of his toy examples.
Dave KauchakPerson was signed in when posted  4
05-16-2002 05:22 PM ET (US)
I thought this paper was interesting and found some of the toy examples in the beginning particularly amusing. However, as others have mentioned, I wish the paper had given a bit more real world motivation. Even the real world tasks that were covered (byte and word prediction) seem somewhat useless (though I could be wrong). I would have preferred to see motivating examples that actually needed to be solved.

I also thought the paper could have done a much better job explaining notation. Although most of the predictive methods are fairly straightforward, I found myself confused sometimes because I couldn't figure out exactly what n or some other variable stood for.
RSS link What's this?
All messages            1-4 of 4        
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2008 Internicity Inc. All rights reserved.