| Who | When |
Messages | |
|
|
|
| Eric Wiewiora
|
15
|
 |
|
04-09-2002 03:25 PM ET (US)
|
|
I'm wondering how the second example's shaping is an example of a potential function. It looks like the potential for a state is time-dependent, so any subsequent state will have an even lower potential than the previous one, even if the agent is just moving back and forth. Am I mistaken in believing that moving in a cycle should always recieve a total F of zero?
|
| Dana Dahlstrom
|
14
|
 |
|
04-09-2002 03:18 PM ET (US)
|
|
Eugene:
I'm thinking of MDPs with finite state spaces, not infinite. As
for what I mean by optimal policy, the (second) definition in
section 2.1 paragraph 4 suits me fine.
I don't think Remark 2 assumes undiscounted MDPs: witness the
gamma in the equation. In any case, I think Remark 2 needs to be
true for discounted MDPs if their policy invariance claim is to
hold in general.
At risk of posting too voluminously, I feel like playing devil's
advocate: what good is shaping if it requires a good distance-
based heuristic? If we have one of those, can't we skip learning
and use it in a straightforward search algorithm, or determine an
optimal policy analytically? Worse, if our distance-based
heuristic is not good, won't shaping actually retard learning?
|
| Eugene Ke
|
13
|
 |
|
04-09-2002 02:51 PM ET (US)
|
|
Dana,
Don't the authors address your concern in section 2.1, last paragraph? They state for MDP's of infinite state spaces (discounted), you have to address the fact that their definition of Vm(S) may not exist. Even after that, you still have to consider what you mean by optimal polices.
I think Remark 2 still assumes undiscounted MDP's, so perhaps this constraint is lost or redefined for the more general case? I may be wrong, but I assumed the paper only focused on a finite-space, and avoided the more general solution. They do briefly state that with the proper generalizations of the required regularity conditions, their results are generalized to the infinte case.
|
| Dana Dahlstrom
|
12
|
 |
|
04-09-2002 01:09 PM ET (US)
|
|
Edited by author 04-09-2002 01:42 PM
I think I can address Joe's questions:
1. They claim potential-based shaping rewards preserve the entire
partial order of policies, not just those which are optimal or
near-optimal. When they write ``(near-)optimal'', I think they
intend both ``optimal'' and ``near-optimal'' simultaneously. (See
their Remark 1 for more detail.)
2. The discount rate quantifies the relative value of delayed
versus immediate rewards. They assume undiscounted MDPs have a
(``distinguished'') absorbing state and that discounted MDPs do
not have a (``corresponding'') absorbing state. This seems to me
the main reason for treating the two cases somewhat separately.
Following up to my original post and the responses to it:
I'm thinking here of discounted (not undiscounted) MDPs. What I
was getting at with self-transitions and small cycles isn't that
they could produce positive rewards, but that they could behave
like absorbing states in discounted MDPs which (I think) aren't
supposed to have any.
Moreover, there could be multiple such ``pseudo-absorbing
states'' with different potentials. As Bret points out, this
gives rise to sub-optimal policies (which would contradict Remark
2).
|
| Joe Drish
|
11
|
 |
|
04-09-2002 09:26 AM ET (US)
|
|
I have 2 questions about this paper:
1. They claim they have shown in the paper that potential-based shaping rewards of a particular form leave (near)-optimal policies unchanged. Why is the (near) qualification necessary? Why not just optimal?
2. What is the significance of the discount factor? It seemed like their results were highly dependent on whether or not the problem and/or shaping function was discounted.
|
| Bret Ehlert
|
10
|
 |
|
04-09-2002 08:31 AM ET (US)
|
|
Edited by author 04-09-2002 09:05 AM
In regards to Dana's comments on Remark 2. I don't think a self-transition is "legitimate" in the sense that it is nothing more than modeling the suspension of time. The running time of such a system would be astronomical :). Also, any cycle would result in a reward of zero.
I think the point of Remark 2 is quite powerful: If a reward function is potential-based then all paths through the state space that eventually terminate at the absorbing state have the same total reward. For me this has provided the intuition behind the proof of sufficiency.
Remark 2 definitely shows one limit of Ng, Harda, and Russell's Theorem, related to Dana's comments. Specifically, it doesn't hold up if there is more than one absorbing state, which is often the case in real reinforcement learning problems. For example, there are many ways to end a game of checkers. If I'm not mistaken?
|
| Bret Ehlert
|
9
|
 |
|
04-09-2002 07:03 AM ET (US)
|
|
Regarding the "simple algebraic manipulation" (comment 7 by Thomas)...
First, subtract phi(s) from both sides.
Second, the gamma * phi(s') cancels with the - phi(s') inside the gamma * max(.)
If I'm not mistaken?
|
| Thomas Z
|
8
|
 |
|
04-09-2002 04:49 AM ET (US)
|
|
In response to Dana's question:
As I understand it, if the optimal policy is defined as argmax_a Q_M(s,a)
and Q_M(s, a) is defined as E_s1_s2[R(s1,a,s2) + gamma*V] ..
s1_s2 indicates transition from s1 to s2.
and if R(s1,a,s2) = phi(s1) - phi(s2) then the expectation value of R(s1, a, s2) is zero, so any policy is optimal in this sense.
|
| Thomas Z
|
7
|
 |
|
04-09-2002 04:40 AM ET (US)
|
|
In the proof of sufficiency, how does the first equation become transformed to the 2nd equation?
Is it because the expectation value of gamma*phi(s_prime) - phi(s) = 0 ?
The "simple algebraic manipulation" is somehow not obvious to me...
|
| Dana Dahlstrom
|
6
|
 |
|
04-09-2002 03:03 AM ET (US)
|
|
Edited by author 04-09-2002 01:49 PM
Is anyone else dubious about Remark 2? It seems to me a
potential-based reward function ought to favor policies which
(are likely to) cause transitions to states with the highest
potentials; these should be preferable, at least, to policies
which cause transitions to low-potential states and then just
hang around waiting for the infinite horizon.
I realize this sounds like an appeal to an ``absorbing state'',
but isn't a self-transition (from state s to state s) legitimate?
And if not, what about a policy which results in small state
transition cycles? These scenarios would emulate absorbing
states without fitting the exact definition in the paper.
If this criticism of Remark 2 is valid, then I think we can
conclude potential-based shaping functions do not guarantee
policy invariance for MDPs in general, with a specific exception
being MDPs with potential-based reward functions.
Comments?
|
| Bianca Zadrozny
|
5
|
 |
|
04-08-2002 10:51 PM ET (US)
|
|
I just talked to Kristin and she told me that the grid "game" does not finish if the agent reaches the goal without going through the subgoals in the right order. So, the optimal policy for this MDP is to go to each subgoal in order and reaching the goal (with the shortest path possible). Adding the shaping function will just accelerate learning.
|
| Bianca Zadrozny
|
4
|
 |
|
04-08-2002 10:24 PM ET (US)
|
|
David, I think that the distinction between the original reward function and the shaping function is a bit more subtle than just environment-based vs. non-environment-based.
Since the reward function associated with an environment is not always natural or obvious, the designer has to create a reward function to make the agent learn a certain desired behavior. For example, in the grid world in the paper, they say that each step gets a negative reward of -1, in order to make the learner go quickier to the goal state. By choosing the reward function, the designer is in a sense choosing the policy that will be learned (i.e. faster paths to the goal state).
On the other hand, the shaping function should not influence the policy that will be learned, it just influences how quickly it will be learned. That's why they say you should use potential functions for shaping. They assume that the current reward function already captures the behavior that must be learned.
But given what I just said, I don't understand the example with subgoals. The original reward function does not capture our intention that the agent go through each subgoal. Since the subgoals are not related to the final goal, I don't understand how it helps at all to make the agent go through them.
|
Dave Kauchak
|
3
|
 |
|
04-08-2002 05:13 PM ET (US)
|
|
I found this paper to be quite well written. I particularly thought their introduction to notation and consistent use of notation throughout the paper made the concepts much easier to grasp.
That said, I do have a few questions/concerns. I found all of the forward references to the infinite state MDPs slightly distracting in the paper. I understand their motivation for putting these in, but I think a single in the introductory section would have been sufficient. Also, I still had trouble grasping how exactly to choose appropriate phis. I think it would have been helpful to see another problem domain besides the grid world.
Finally, I still find myself a bit confused about the difference between the normal reward function and the shaping reward function. It seems to me that the normal reward function is imposed by the environment and the shaping reward function is something that is added by the algorithm designers. However, this distinction doesn't seem totally clear to me.
|
| Bianca Zadrozny
|
2
|
 |
|
04-07-2002 08:47 PM ET (US)
|
|
sup stands for Supremum. From Mathworld ( http://mathworld.wolfram.com/Supremum.html): "The supremum is the least upper bound of a set S, defined as a quantity M such that no member of the set exceeds M, but if \epsilon is any positive quantity, however small, there is a member that exceeds M - \epsilon." For example, the supremum of S=(0,1) (open interval from 0 to 1) is 1. The supremum is commonly used in place of maximum when the set contains real numbers because the maximum may not exist (as is the case for the interval (0,1), since 1 does not belong to the set). So, you can think of V_M*(s)=sup_\pi V_M^\pi(s) as V_M*(s)=max_\pi V_M^\pi(s) (except that V* may never be reached exactly for any policy \pi although we can get arbitrarily close to it).
|
| Thomas Z
|
1
|
 |
|
04-05-2002 04:03 PM ET (US)
|
|
Could someone tell me what the symbol sup_pi in the paper mean? Thanks
|