| Who | When |
Messages | |
|
|
|
|
|
| Dana Dahlstrom
|
6
|
 |
|
04-24-2002 12:35 AM ET (US)
|
|
Joe:
``ergodic, Dirichlet prior and distribution, Tchebycheff's
inequality, Kullback-Leibler distance, etc.''
Ha! I felt the same way. (Except for Tchebycheff's inequality,
which I pattern matched to the Chebyshev's inequality I studied
in my undergraduate probability course.)
Obviously I didn't get around to thoroughly understanding
Dirichlet priors and distributions---Charles had to cover for me
there---but I did spend a lot of time reading about Markov Chains
and ergodicity. If you're interested, here is the best treatment
I've found on the web as yet:
http://random.mat.sbg.ac.at/~ste/diss/node6.html
I can't resist adding on a cynical note that perhaps all the
highfalutin terminology helped get the paper published. Price and
Boutilier's 1998 paper was based on the same ground ideas but was
much easier to understand; it didn't get accepted to ICML. :)
|
| Joe Drish
|
5
|
 |
|
04-23-2002 06:21 PM ET (US)
|
|
This paper should have been more focused, that is it should have concentrated the analysis on a few experiments, maybe just those shown in Figures 1 and 2. This paper assumes that the reader knows many of the mathematical terms within it, like ergodic, Dirichlet prior and distribution, Tchebycheff's inequality, Kullback-Leibler distance, etc.
|
| Dana Dahlstrom
|
4
|
 |
|
04-23-2002 04:17 PM ET (US)
|
|
Dave:
I wholeheartedly agree the exposition in this paper is far from
an ideal elucidation of their ideas. Perhaps it's because I
lacked some of the necessary background, but I really struggled
to understand it in detail.
I don't think they've assumed the mentor and the observer have
the same goal(s), however. As you point out, they stress that the
mentor and the observer don't need to have the same reward
function for their technique to work. The example illustrated in
figures 9 and 10 is their attempt to demonstrate this.
Greg:
This is sort of a fine point, but in figures 3 and 4 I don't
think the mentor is giving misleading information; rather, the
learner's prior beliefs about the mentor's transition
probabilities are doing the damage. I believe the mentor itself
is actually following an optimal policy.
This kind of misunderstanding is easy to make, especially being
that their explanation of how the priors are computed is (for me)
less than adequate.
|
Greg Hamerly
|
3
|
 |
|
04-23-2002 02:43 PM ET (US)
|
|
One thing I liked about this paper is the experiment they performed in figures 3 & 4, which has the mentor giving misleading information. Clearly, giving a training signal that is better than random information (with a mentor that has a correct policy) will give an improved performance -- but what about evil mentors, or error-prone mentors? This experiment speaks to those.
Obviously the constraints that the mentor's state space be a subset of the observer's is a strict one; can anyone comment on how well this restriction can be overcome?
|
Dave Kauchak
|
2
|
 |
|
04-23-2002 04:16 AM ET (US)
|
|
I thought the ideas of the paper were good and was fairly impressed with the results, however, I found the writing of the paper to be a bit inhibitive for understanding the ideas clearly.
The paper tended to use an interesting vocabulary, such as "extract information" and numerour references to "behavior" with no obvious definition. Most of the time the meaning could be inferred, but it left an imprecise feeling.
The introduction and particularly the abstract were somewhat uninformative about the paper.
I think the paper could have used some examples to help clarify some of the ideas. For example, the paper states that there are many situation where R0 is known, but not the transition probabilities. A list of a few cases would help make the descriptions more understandable.
Finally, one key assumption that is implied, but I think is left out is that the mentor and the observer have the same goal(s). I found this a bit confusing initially, particularly given that the mentor and the observer don't have to have the same reward function.
|
| Eric Wiewiora
|
1
|
 |
|
04-22-2002 07:21 PM ET (US)
|
|
One of the things I note about the paper is that they do not state how long the obeserver gets to observe the mentor. Because their algorithm is so dependent on the statistical reliability of the mentor's value function, it would make sense to be more explicit on this issue.
|