QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: Implicit Imitation in Multi-agent Reinforcement Learning
Views: 234, Unique: 165 
Subscribers: 0
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
About these ads
Who | When
Messagessort recent-bottom   
Post a new message
 
Dana Dahlstrom  6
04-24-2002 12:35 AM ET (US)
Joe:

``ergodic,  Dirichlet  prior  and   distribution,   Tchebycheff's
inequality, Kullback-Leibler distance, etc.''

Ha! I felt the same way. (Except  for  Tchebycheff's  inequality,
which  I  pattern matched to the Chebyshev's inequality I studied
in my undergraduate probability course.)

Obviously  I  didn't  get  around  to  thoroughly   understanding
Dirichlet  priors and distributions---Charles had to cover for me
there---but I did spend a lot of time reading about Markov Chains
and  ergodicity. If you're interested, here is the best treatment
I've found on the web as yet:

http://random.mat.sbg.ac.at/~ste/diss/node6.html

I can't resist adding on a cynical  note  that  perhaps  all  the
highfalutin terminology helped get the paper published. Price and
Boutilier's 1998 paper was based on the same ground ideas but was
much easier to understand; it didn't get accepted to ICML. :)
Joe Drish  5
04-23-2002 06:21 PM ET (US)
This paper should have been more focused, that is it should have concentrated the analysis on a few experiments, maybe just those shown in Figures 1 and 2. This paper assumes that the reader knows many of the mathematical terms within it, like ergodic, Dirichlet prior and distribution, Tchebycheff's inequality, Kullback-Leibler distance, etc.
Dana Dahlstrom  4
04-23-2002 04:17 PM ET (US)
Dave:

I wholeheartedly agree the exposition in this paper is  far  from
an  ideal  elucidation  of  their  ideas.  Perhaps it's because I
lacked some of the necessary background, but I  really  struggled
to understand it in detail.

I don't think they've assumed the mentor and  the  observer  have
the same goal(s), however. As you point out, they stress that the
mentor and the observer  don't  need  to  have  the  same  reward
function  for their technique to work. The example illustrated in
figures 9 and 10 is their attempt to demonstrate this.

Greg:

This is sort of a fine point, but in figures  3  and  4  I  don't
think  the  mentor  is giving misleading information; rather, the
learner's   prior   beliefs   about   the   mentor's   transition
probabilities  are  doing the damage. I believe the mentor itself
is actually following an optimal policy.

This kind of misunderstanding is easy to make,  especially  being
that their explanation of how the priors are computed is (for me)
less than adequate.
Greg HamerlyPerson was signed in when posted  3
04-23-2002 02:43 PM ET (US)
One thing I liked about this paper is the experiment they performed in figures 3 & 4, which has the mentor giving misleading information. Clearly, giving a training signal that is better than random information (with a mentor that has a correct policy) will give an improved performance -- but what about evil mentors, or error-prone mentors? This experiment speaks to those.

Obviously the constraints that the mentor's state space be a subset of the observer's is a strict one; can anyone comment on how well this restriction can be overcome?
Dave KauchakPerson was signed in when posted  2
04-23-2002 04:16 AM ET (US)
I thought the ideas of the paper were good and was fairly impressed with the results, however, I found the writing of the paper to be a bit inhibitive for understanding the ideas clearly.

The paper tended to use an interesting vocabulary, such as "extract information" and numerour references to "behavior" with no obvious definition. Most of the time the meaning could be inferred, but it left an imprecise feeling.

The introduction and particularly the abstract were somewhat uninformative about the paper.

I think the paper could have used some examples to help clarify some of the ideas. For example, the paper states that there are many situation where R0 is known, but not the transition probabilities. A list of a few cases would help make the descriptions more understandable.

Finally, one key assumption that is implied, but I think is left out is that the mentor and the observer have the same goal(s). I found this a bit confusing initially, particularly given that the mentor and the observer don't have to have the same reward function.
Eric Wiewiora  1
04-22-2002 07:21 PM ET (US)
One of the things I note about the paper is that they do not state how long the obeserver gets to observe the mentor. Because their algorithm is so dependent on the statistical reliability of the mentor's value function, it would make sense to be more explicit on this issue.
RSS link What's this?
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2006 Internicity Inc. All rights reserved.