| Who | When |
Messages | |
(not accepting new messages)
|
|
| Charles Elkan
|
136
|
 |
|
12-08-2008 12:08 PM ET (US)
|
|
Final exam Thursday this week
The final exam is scheduled from 8am to 11am on Thursday this week. If there is a consensus, I would be happy to make it shorter and start later, say starting at 8:45am. But there would have to be a consensus. Any opinions?
|
| Meir Schwarz
|
135
|
 |
|
12-08-2008 02:02 AM ET (US)
|
|
/m133: What if we stop when the policy doesn't change and the change in Q is smaller than some tunable variable which we find by experimenting?
|
| Charles Elkan
|
134
|
 |
|
12-08-2008 01:36 AM ET (US)
|
|
/m130: Remember, a policy doesn't choose the next state. Instead, it chooses an action and then the next state is random, but influenced by the action. Slide 8 says "maximize the expected utility of the immediate successors." So, the optimal action is not necessarily in the direction of the highest-value state. Different gamma values certainly lead to different value functions and different optimal policies. Slide 6 shows different policies caused by different penalties. You could draw a similar slide based on different discount factors.
|
| Charles Elkan
|
133
|
 |
|
12-08-2008 01:24 AM ET (US)
|
|
/m131, /m126: I mean, how can you be sure that continuing Q-learning for more trials will not lead to any further change in the policy. A trial is one "lifetime" for the agent, from the initial state until one of the two terminal states.
|
| Charles Elkan
|
132
|
 |
|
12-08-2008 01:21 AM ET (US)
|
|
/m129, /m127: Tobias, thank your for the correct answer.
|
| Meir Schwarz
|
131
|
 |
|
12-08-2008 12:35 AM ET (US)
|
|
But you need to ask the question, how can you tell for sure that pi will never change any more? Do you mean like make sure PI doesn't change for an entire game?
|
| matt
|
130
|
 |
|
12-07-2008 11:42 PM ET (US)
|
|
Edited by author 12-08-2008 12:31 AM
new question:
ok, on slide 8, there is something wrong. The picture of the policy does not match the values given, if it is maximizing the utility then the policy from (1,3). The utility for going left is .655, the utility for going up is .66 and utility for going right is .38. It is supposed to choose the max utility, but the picture, has the policy going left, but the max utility would be going north, so the picture of the given optimal policy is wrong or the actual values on slide 8 for position (1,2) and (2,3) are swapped. Please clarify.
Also, are we assuming that gamma is 1, because gamma does affect the values you get when running policy iteration?
|
| Tobias
|
129
|
 |
|
12-07-2008 11:16 PM ET (US)
|
|
@matt: It is not possible to go left or down. It is just there for completeness which can make computations more convenient since you need no exceptions for the goal/trap cells. So for (1,1): If you go left then the robot would bump into the wall. If it would slide to the left (and therefore go down) then it would bump into the wall too. Therefore in these two cases the robot would remain in its position - thats the 0.9U(1,1). The slide to the right side would result in going upwards, therefore 0.1U(1,2). The same applies for the 'down' part of the equation.
|
| Mike Rose
|
128
|
 |
|
12-07-2008 09:46 PM ET (US)
|
|
Deleted by author 12-07-2008 10:59 PM
|
| matt
|
127
|
 |
|
12-07-2008 09:40 PM ET (US)
|
|
ok, maybe we are misunderstanding the algorithm. So, on slide 10 of the notes there is this algorithm U(s) = R(s) + γ max a s′U(s′)T(s, a, s′)(see slides for proper format)
with example U(1, 1) = −0.04 + γ max{0.8U(1, 2) + 0.1U(2, 1) + 0.1U(1, 1), up 0.9U(1, 1) + 0.1U(1, 2) left 0.9U(1, 1) + 0.1U(2, 1) down 0.8U(2, 1) + 0.1U(1, 2) + 0.1U(1, 1)} right
my question is, if you are at spot (1,1) how is it possible to go all 4 directions, as far i know there is no wrap around, so why are all directions possible?
|
| Charles Elkan
|
126
|
 |
|
12-07-2008 09:39 PM ET (US)
|
|
Are we allowed to change epsilon or the constant in softmax as learning progresses? Sure. It would be interesting to do experiments to find the best schedule for these values, similar to the schedule for T in deterministic annealing.
When do we stop iterating for Q learning? Is is when PI stops changing like policy iteration (doesn't seem right to me) or is it when Q stops changing? Since the real goal is always to learn an optimal policy, stopping when the policy pi stops changing seems sensible. But you need to ask the question, how can you tell for sure that pi will never change any more?
|
| Meir Schwarz
|
125
|
 |
|
12-07-2008 09:16 PM ET (US)
|
|
Another question. When do we stop iterating for Q learning? Is is when PI stops changing like policy iteration (doesn't seem right to me) or is it when Q stops changing?
|
| Meir Schwarz
|
124
|
 |
|
12-07-2008 08:50 PM ET (US)
|
|
Are we allowed to change epsilon or the constant in softmax as learning progresses?
|
| Charles Elkan
|
123
|
 |
|
12-07-2008 06:46 PM ET (US)
|
|
How is it possible for our algorithm to work to find the optimal policy, but not be able to find the alternative optimal policies (it basically finds the opposite of those policies when values in range are entered)?It does sound like you have one (or more!) bugs. For debugging suggestions see /m122.
|
| Charles Elkan
|
122
|
 |
|
12-07-2008 06:45 PM ET (US)
|
|
Our Policy Iteration algorithm produces the correct policy but the values are seem to be off by about .05. Is this acceptable or does it hint that we have a slight error in our algorithm? If not, any idea where we should start debugging?
Many different value functions can lead to the same policy, so this is a hint that there is an error somewhere. The error may be in your code, but it may also be in your understanding of the gridworld domain. For example, the exact definition of when an action has an unintended effect (e.g. moving left on an "up" action) is not clear.
For debugging, it may be easiest to start with the goal state (upper right) and figure out whether its learned value is correct. Then do the same for states next to the goal state, then one away, and so on.
|
| Charles Elkan
|
121
|
 |
|
12-07-2008 06:40 PM ET (US)
|
|
Will there be a review session for the final? In the section tomorrow (Monday) you can ask any and all questions relevant to the final. Or you can ask questions here.
If anyone feels an additional in-person review session would be useful, please email me personally.
Remember, a review session is only useful if you come prepared with specific questions.
|