QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: Policy invariance under reward transformations: ... reward shaping
Views: 343, Unique: 189 
Subscribers: 0
What's
this?
Printer-Friendly Page
Subscribe to get & post, or stop messages by email Subscribe
All messages    << 16-31  1-15 of 31        
About these ads
Who | When
Messagessort recent-top   
Post a new message
 
Thomas Z  1
04-05-2002 04:03 PM ET (US)
Could someone tell me what the symbol sup_pi in the paper mean? Thanks
Bianca Zadrozny  2
04-07-2002 08:47 PM ET (US)
sup stands for Supremum.

From Mathworld (http://mathworld.wolfram.com/Supremum.html):

"The supremum is the least upper bound of a set S, defined as a quantity M such that no member of the set exceeds M, but if \epsilon is any positive quantity, however small, there is a member that exceeds M - \epsilon."

For example, the supremum of S=(0,1) (open interval from 0 to 1) is 1.

The supremum is commonly used in place of maximum when the set contains real numbers because the maximum may not exist (as is the case for the interval (0,1), since 1 does not belong to the set).

So, you can think of V_M*(s)=sup_\pi V_M^\pi(s) as
 V_M*(s)=max_\pi V_M^\pi(s) (except that V* may never be reached exactly for any policy \pi although we can get arbitrarily close to it).
Dave KauchakPerson was signed in when posted  3
04-08-2002 05:13 PM ET (US)
I found this paper to be quite well written. I particularly thought their introduction to notation and consistent use of notation throughout the paper made the concepts much easier to grasp.

That said, I do have a few questions/concerns. I found all of the forward references to the infinite state MDPs slightly distracting in the paper. I understand their motivation for putting these in, but I think a single in the introductory section would have been sufficient. Also, I still had trouble grasping how exactly to choose appropriate phis. I think it would have been helpful to see another problem domain besides the grid world.

Finally, I still find myself a bit confused about the difference between the normal reward function and the shaping reward function. It seems to me that the normal reward function is imposed by the environment and the shaping reward function is something that is added by the algorithm designers. However, this distinction doesn't seem totally clear to me.
Bianca Zadrozny  4
04-08-2002 10:24 PM ET (US)
David, I think that the distinction between the original reward function and the shaping function is a bit more subtle than just environment-based vs. non-environment-based.

Since the reward function associated with an environment is not always natural or obvious, the designer has to create a reward function to make the agent learn a certain desired behavior. For example, in the grid world in the paper, they say that each step gets a negative reward of -1, in order to make the learner go quickier to the goal state. By choosing the reward function, the designer is in a sense choosing the policy that will be learned (i.e. faster paths to the goal state).

On the other hand, the shaping function should not influence the policy that will be learned, it just influences how quickly it will be learned. That's why they say you should use potential functions for shaping. They assume that the current reward function already captures the behavior that must be learned.

But given what I just said, I don't understand the example with subgoals. The original reward function does not capture our intention that the agent go through each subgoal. Since the subgoals are not related to the final goal, I don't understand how it helps at all to make the agent go through them.
Bianca Zadrozny  5
04-08-2002 10:51 PM ET (US)
I just talked to Kristin and she told me that the grid "game" does not finish if the agent reaches the goal without going through the subgoals in the right order. So, the optimal policy for this MDP is to go to each subgoal in order and reaching the goal (with the shortest path possible). Adding the shaping function will just accelerate learning.
Dana Dahlstrom  6
04-09-2002 03:03 AM ET (US)
Edited by author 04-09-2002 01:49 PM
Is anyone  else  dubious  about  Remark  2?  It  seems  to  me  a
potential-based  reward  function  ought  to favor policies which
(are likely to) cause transitions  to  states  with  the  highest
potentials;  these  should  be  preferable, at least, to policies
which cause transitions to low-potential  states  and  then  just
hang around waiting for the infinite horizon.

I realize this sounds like an appeal to an  ``absorbing  state'',
but isn't a self-transition (from state s to state s) legitimate?
And if not, what about a policy  which  results  in  small  state
transition  cycles?   These  scenarios  would  emulate  absorbing
states without fitting the exact definition in the paper.

If this criticism of Remark 2 is  valid,  then  I  think  we  can
conclude  potential-based  shaping  functions  do  not  guarantee
policy invariance for MDPs in general, with a specific  exception
being MDPs with potential-based reward functions.

Comments?
Thomas Z  7
04-09-2002 04:40 AM ET (US)
In the proof of sufficiency, how does the first equation become transformed to the 2nd equation?

Is it because the expectation value of
gamma*phi(s_prime) - phi(s) = 0 ?

The "simple algebraic manipulation" is somehow not obvious to me...
Thomas Z  8
04-09-2002 04:49 AM ET (US)
In response to Dana's question:

As I understand it, if the optimal policy is defined as
argmax_a Q_M(s,a)

and Q_M(s, a) is defined as
 E_s1_s2[R(s1,a,s2) + gamma*V] ..

s1_s2 indicates transition from s1 to s2.

and if R(s1,a,s2) = phi(s1) - phi(s2)
then the expectation value of R(s1, a, s2) is zero, so any policy is optimal in this sense.
Bret Ehlert  9
04-09-2002 07:03 AM ET (US)
Regarding the "simple algebraic manipulation" (comment 7 by Thomas)...

First, subtract phi(s) from both sides.

Second, the gamma * phi(s') cancels with the - phi(s') inside the gamma * max(.)

If I'm not mistaken?
Bret Ehlert  10
04-09-2002 08:31 AM ET (US)
Edited by author 04-09-2002 09:05 AM
In regards to Dana's comments on Remark 2. I don't think a self-transition is "legitimate" in the sense that it is nothing more than modeling the suspension of time. The running time of such a system would be astronomical :). Also, any cycle would result in a reward of zero.

I think the point of Remark 2 is quite powerful: If a reward function is potential-based then all paths through the state space that eventually terminate at the absorbing state have the same total reward. For me this has provided the intuition behind the proof of sufficiency.

Remark 2 definitely shows one limit of Ng, Harda, and Russell's Theorem, related to Dana's comments. Specifically, it doesn't hold up if there is more than one absorbing state, which is often the case in real reinforcement learning problems. For example, there are many ways to end a game of checkers. If I'm not mistaken?
Joe Drish  11
04-09-2002 09:26 AM ET (US)
I have 2 questions about this paper:

1. They claim they have shown in the paper that potential-based shaping rewards of a particular form leave (near)-optimal policies unchanged. Why is the (near) qualification necessary? Why not just optimal?

2. What is the significance of the discount factor? It seemed like their results were highly dependent on whether or not the problem and/or shaping function was discounted.
Dana Dahlstrom  12
04-09-2002 01:09 PM ET (US)
Edited by author 04-09-2002 01:42 PM
I think I can address Joe's questions:

1. They claim potential-based shaping rewards preserve the entire
partial  order  of  policies, not just those which are optimal or
near-optimal. When they write ``(near-)optimal'',  I  think  they
intend both ``optimal'' and ``near-optimal'' simultaneously. (See
their Remark 1 for more detail.)

2. The discount rate quantifies the  relative  value  of  delayed
versus  immediate  rewards.  They assume undiscounted MDPs have a
(``distinguished'') absorbing state and that discounted  MDPs  do
not  have a (``corresponding'') absorbing state. This seems to me
the main reason for treating the two cases somewhat separately.

Following up to my original post and the responses to it:

I'm thinking here of discounted (not undiscounted) MDPs.  What  I
was  getting at with self-transitions and small cycles isn't that
they could produce positive rewards, but that they  could  behave
like  absorbing  states in discounted MDPs which (I think) aren't
supposed to have any.

Moreover,  there  could  be  multiple   such   ``pseudo-absorbing
states''  with  different  potentials.  As  Bret points out, this
gives rise to sub-optimal policies (which would contradict Remark
2).
Eugene Ke  13
04-09-2002 02:51 PM ET (US)
Dana,

Don't the authors address your concern in section 2.1, last paragraph? They state for MDP's of infinite state spaces (discounted), you have to address the fact that their definition of Vm(S) may not exist. Even after that, you still have to consider what you mean by optimal polices.

I think Remark 2 still assumes undiscounted MDP's, so perhaps this constraint is lost or redefined for the more general case? I may be wrong, but I assumed the paper only focused on a finite-space, and avoided the more general solution. They do briefly state that with the proper generalizations of the required regularity conditions, their results are generalized to the infinte case.
Dana Dahlstrom  14
04-09-2002 03:18 PM ET (US)
Eugene:

I'm thinking of MDPs with finite state spaces, not  infinite.  As
for  what  I  mean  by optimal policy, the (second) definition in
section 2.1 paragraph 4 suits me fine.

I don't think Remark 2 assumes  undiscounted  MDPs:  witness  the
gamma  in the equation. In any case, I think Remark 2 needs to be
true for discounted MDPs if their policy invariance claim  is  to
hold in general.

At risk of posting too voluminously, I feel like playing  devil's
advocate:  what  good  is shaping if it requires a good distance-
based heuristic? If we have one of those, can't we skip  learning
and use it in a straightforward search algorithm, or determine an
optimal  policy  analytically?  Worse,  if   our   distance-based
heuristic is not good, won't shaping actually retard learning?
Eric Wiewiora  15
04-09-2002 03:25 PM ET (US)
I'm wondering how the second example's shaping is an example of a potential function. It looks like the potential for a state is time-dependent, so any subsequent state will have an even lower potential than the previous one, even if the agent is just moving back and forth. Am I mistaken in believing that moving in a cycle should always recieve a total F of zero?
RSS link What's this?
All messages    << 16-31  1-15 of 31        
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2008 Internicity Inc. All rights reserved.