| Who | When |
Messages | |
|
|
|
Dave Kauchak
|
1
|
 |
|
04-25-2002 03:37 AM ET (US)
|
|
Edited by author 04-25-2002 03:38 AM
I hate to be the one always complaining about the writing of these papers, but I found this paper particularly difficult to digest because of the writing. Some of this may arise because the paper seems to be at the intersection of reinforcement learning and game theory.
I think one of my main complaints is vocabulary/notation. The paper uses 'observations' instead of 'states'. Then, to make matters work, the authors use s to abbrevate sources. This makes many of the equations look odd, given that most reinforcement learning text uses s as the abbreviation for state. On top of all this, the problems in the experimental section involve states of a seemingly different sort.
Also, in a number of places the paper seems to try and present things in chronological order, which tends to add to the confusion.
Beyond the writing, I also have a couple of questions. Is there a better or simpler method to solve some of these problems? For example, the paper sites that sources might knowingly adjust scores for personal gain. In this situation, it seems to me that if we simply impose the constraint that sources don't have knowledge (or minimal knowledge) of other sources and we simply scale the rewards from these sources then we can get the desired result.
Is there a reason that the paper chose the particular set of problems to experiment with? Although I have not done an extensive search of reinforcement learning text, I have never encountered these particular problems. Could we simply use the grid problem with multiple goals? It seems to me that using a more common test problem would be useful not only for familiarity, but credibility.
|
| Eugene Ke
|
2
|
 |
|
04-25-2002 11:26 AM ET (US)
|
|
This is off topic, but I'm curious if anyone went to Dr. Hecht-Nielson's seminar last Tuesday? I couldn't make it but would like to know what ya'll thought?
|
| Eric Wiewiora
|
3
|
 |
|
04-25-2002 01:56 PM ET (US)
|
|
Response to Dave:
The notation in this paper is a little odd because it is more similar to the notation used by game theorists than ML researchers.
The two examples in the paper were chosen because they show off the the two main benefits of the algorithm.
The first example shows that the policy of the agent adapts immediately to changes in reward sources.
The second example shows how the agent adopts a policy of having the sources compete over one state, not all of them. This behavior leads to a more consistent policy, and would be hard to capture with a simpler algorithm.
|
| Gyozo Gidofalvi
|
4
|
 |
|
04-25-2002 02:59 PM ET (US)
|
|
I found the problem of governing a public resource by a single agent a novel one. This problem has many useful and hard to solve real word applications.
I also liked the approach taken to learn a voting scheme that cannot be manipulated by individual sources such that the output policy learned by the agent is optimal in terms of the preferences of any given source. By choosing votes that obey the Nash equilibrium, the algorithm ensures that it is in the best interest of any given source to train the agent according to the true preference of that source.
I felt like the both of the problems and its difficulties could have been explained more clearly. A simple explanation, like the one given by Eric, about what the examples show would have been also very helpful.
|
| Yohan Kim
|
5
|
 |
|
04-25-2002 03:29 PM ET (US)
|
|
I would've liked more on the choice of the form of equation estimating the return (equation 2). What guided this choice and so on.
Question concerning equation 5: gradient descent optimization was used to arrive at a solution for alpha_s(x). I was wondering whether the author was able to guarantee that the arrived value is the global minimum.
|
| Dana Dahlstrom
|
6
|
 |
|
04-25-2002 03:56 PM ET (US)
|
|
Dave: I think it makes sense to distinguish between observations
and states when the environment is only partially observable; the
observation is just the visible part of the state.
It's odd to me that none of the scenarios in the figures looks
like a Nash equilibrium, yet the authors write of both examples
that ``the algorithm consistently settled on the solution
shown''. For some reason, though, the policy selected for the
first example is only ``approximately uniform''. Why?
The second example seems to have even stranger irregularities.
Shouldn't the desired policies, votes, and resultant policy
reflect the symmetry in the state diagram? Why then, for example,
do both sources agree the agent should move left in states 2 and
3, but not that it should move right in states 7 and 8? Not only
do the authors not explain this, but they completely neglect to
mention it.
|
| Eric Wiewiora
|
7
|
 |
|
04-25-2002 05:29 PM ET (US)
|
|
I hope my presentation will clarify the issues raised about what the examples show and why the estimated return equation was chosen. By the way, i think the equation is written wrong.
Dana: Yes, there are plenty of inconsistencies and assymetries in the prefered policies of the sources. I think this relates back to how the policies were learned.
|
| Eric Wiewiora
|
8
|
 |
|
04-25-2002 05:29 PM ET (US)
|
|
I hope my presentation will clarify the issues raised about what the examples show and why the estimated return equation was chosen. By the way, i think the equation is written wrong.
Dana: Yes, there are plenty of inconsistencies and assymetries in the prefered policies of the sources. I think this relates back to how the policies were learned using linear least-squared error.
|
| Joe Drish
|
9
|
 |
|
04-25-2002 06:04 PM ET (US)
|
|
I agree that they should have related their work to baseline RL performance metrics and problems. I do however appreciate their attempt at solving [somewhat] real world problems. The explanation of the Nash equilibrium problem could have been made more simple, and concrete. A plus about the paper is that it is not too ambitious.
|
| Degui Zhi
|
10
|
 |
|
04-25-2002 06:20 PM ET (US)
|
|
I appreciate much of the algorithmic construct the author did. I think in machine learning society there are a few types of peoples: theorist, practitioner are two extremes. and the author of this paper sit in between.
|