QuickTopic (SM) free message boards QuickTopic (SM) free message boards
Skip to Messages
  Sign In to access your topic list  |New Topic |My Topics|Profile
Upgrade to Pro   Customize, show pictures, add an intro, and more:   QuickTopic Pro...and check out QuickThreadSM
Topic: CSE 151 in Fall 2008
Printer-Friendly Page
All messages    << 137-152  121-136 of 153  105-120 >>
About these ads
Who | When
Messagessort recent-bottom    (not accepting new messages)
Charles Elkan  136
12-08-2008 12:08 PM ET (US)
Final exam Thursday this week

The final exam is scheduled from 8am to 11am on Thursday this week. If there is a consensus, I would be happy to make it shorter and start later, say starting at 8:45am. But there would have to be a consensus. Any opinions?
Meir Schwarz  135
12-08-2008 02:02 AM ET (US)
/m133: What if we stop when the policy doesn't change and the change in Q is smaller than some tunable variable which we find by experimenting?
Charles Elkan  134
12-08-2008 01:36 AM ET (US)
/m130: Remember, a policy doesn't choose the next state. Instead, it chooses an action and then the next state is random, but influenced by the action. Slide 8 says "maximize the expected utility of the immediate successors." So, the optimal action is not necessarily in the direction of the highest-value state.

Different gamma values certainly lead to different value functions and different optimal policies. Slide 6 shows different policies caused by different penalties. You could draw a similar slide based on different discount factors.
Charles Elkan  133
12-08-2008 01:24 AM ET (US)
/m131, /m126: I mean, how can you be sure that continuing Q-learning for more trials will not lead to any further change in the policy. A trial is one "lifetime" for the agent, from the initial state until one of the two terminal states.
Charles Elkan  132
12-08-2008 01:21 AM ET (US)
/m129, /m127: Tobias, thank your for the correct answer.
Meir Schwarz  131
12-08-2008 12:35 AM ET (US)
But you need to ask the question, how can you tell for sure that pi will never change any more?
Do you mean like make sure PI doesn't change for an entire game?
matt  130
12-07-2008 11:42 PM ET (US)
Edited by author 12-08-2008 12:31 AM
new question:

ok, on slide 8, there is something wrong. The picture of the policy does not match the values given, if it is maximizing the utility then the policy from (1,3). The utility for going left is .655, the utility for going up is .66 and utility for going right is .38. It is supposed to choose the max utility, but the picture, has the policy going left, but the max utility would be going north, so the picture of the given optimal policy is wrong or the actual values on slide 8 for position (1,2) and (2,3) are swapped. Please clarify.

Also, are we assuming that gamma is 1, because gamma does affect the values you get when running policy iteration?
Tobias  129
12-07-2008 11:16 PM ET (US)
@matt: It is not possible to go left or down. It is just there for completeness which can make computations more convenient since you need no exceptions for the goal/trap cells.
So for (1,1): If you go left then the robot would bump into the wall. If it would slide to the left (and therefore go down) then it would bump into the wall too. Therefore in these two cases the robot would remain in its position - thats the 0.9U(1,1). The slide to the right side would result in going upwards, therefore 0.1U(1,2). The same applies for the 'down' part of the equation.
Mike Rose  128
12-07-2008 09:46 PM ET (US)
Deleted by author 12-07-2008 10:59 PM
matt  127
12-07-2008 09:40 PM ET (US)
ok, maybe we are misunderstanding the algorithm. So, on slide 10 of the notes there is this algorithm
U(s) = R(s) + γ max a s′U(s′)T(s, a, s′)(see slides for proper format)

with example
U(1, 1) = −0.04
+ γ max{0.8U(1, 2) + 0.1U(2, 1) + 0.1U(1, 1), up
0.9U(1, 1) + 0.1U(1, 2) left
0.9U(1, 1) + 0.1U(2, 1) down
0.8U(2, 1) + 0.1U(1, 2) + 0.1U(1, 1)} right

my question is, if you are at spot (1,1) how is it possible to go all 4 directions, as far i know there is no wrap around, so why are all directions possible?
Charles Elkan  126
12-07-2008 09:39 PM ET (US)
Are we allowed to change epsilon or the constant in softmax as learning progresses?
Sure. It would be interesting to do experiments to find the best schedule for these values, similar to the schedule for T in deterministic annealing.

When do we stop iterating for Q learning? Is is when PI stops changing like policy iteration (doesn't seem right to me) or is it when Q stops changing?
Since the real goal is always to learn an optimal policy, stopping when the policy pi stops changing seems sensible. But you need to ask the question, how can you tell for sure that pi will never change any more?
Meir Schwarz  125
12-07-2008 09:16 PM ET (US)
Another question. When do we stop iterating for Q learning? Is is when PI stops changing like policy iteration (doesn't seem right to me) or is it when Q stops changing?
Meir Schwarz  124
12-07-2008 08:50 PM ET (US)
Are we allowed to change epsilon or the constant in softmax as learning progresses?
Charles Elkan  123
12-07-2008 06:46 PM ET (US)
How is it possible for our algorithm to work to find the optimal policy, but not be able to find the alternative optimal policies (it basically finds the opposite of those policies when values in range are entered)?

It does sound like you have one (or more!) bugs. For debugging suggestions see /m122.
Charles Elkan  122
12-07-2008 06:45 PM ET (US)
Our Policy Iteration algorithm produces the correct policy but the values are seem to be off by about .05. Is this acceptable or does it hint that we have a slight error in our algorithm? If not, any idea where we should start debugging?

Many different value functions can lead to the same policy, so this is a hint that there is an error somewhere. The error may be in your code, but it may also be in your understanding of the gridworld domain. For example, the exact definition of when an action has an unintended effect (e.g. moving left on an "up" action) is not clear.

For debugging, it may be easiest to start with the goal state (upper right) and figure out whether its learned value is correct. Then do the same for states next to the goal state, then one away, and so on.
Charles Elkan  121
12-07-2008 06:40 PM ET (US)
Will there be a review session for the final?
In the section tomorrow (Monday) you can ask any and all questions relevant to the final. Or you can ask questions here.

If anyone feels an additional in-person review session would be useful, please email me personally.

Remember, a review session is only useful if you come prepared with specific questions.
RSS link What's this?
All messages    << 137-152  121-136 of 153  105-120 >>
QuickTopicSM message boards
Over 200,000 topics served
Learn more Frequently asked questions  Acknowledgements
What they're saying about QuickTopic
 Questions, comments, or suggestions? Contact Us
Read our use policy before beginning. We value your privacy; please read our privacy statement.
Copyright ©1999-2008 Internicity Inc. All rights reserved.