| Degui Zhi
|
23
|
 |
|
04-10-2002 06:09 PM ET (US)
|
|
Also I was trying to understand the potential with discount. Consider a circle s1->s2->s1: shaping reward for s1->s2 is gamma*phi(s2)-phi(s1) shaping reward for s2->s1 is gamma*phi(s1)-phi(s2) adding up we get total shaping reward is -(1-gamma)(phi(s1)+phi(s2))
then the question is: phi(.)>0? or phi(.)<0, however I think the learning (of optimal policy) should be invariant wrt to the sign of the phi function...
I assume the author has just relaxed the definition of discounted potential-based function and assume phi(.) > 0 so that the discount is always negative. (however, in their experiment they used negative potential which is confusing to me)
|