| Dana Dahlstrom
|
5
|
 |
|
04-04-2002 03:03 PM ET (US)
|
|
I think Kristin's qualm with the notation in section 2.1 is right; it seems there is some inconsistency:
In paragraph 3 the sequences of actions and rewards begin with $a_1$ and $r_1$. The state $s_0$ is mentioned, but not $a_0$ or $r_0$. This leads me to suppose $r_i$ is a function of $s_{i-1}$ and $a_i$.
Equation 1 is consistent with this supposition: $r_1$ is discounted by $\gamma^{0} = 1$, and so is treated as an immediate reward. But in Equation 2 $r_1$ is discounted by $\gamma$, and suddenly there is an $a_0$. I think this can be remedied by changing $\gamma^t$ to $\gamma^{t-1}$ and $a_0$ to $a_1$ in Equation 2.
Equation 3 is consistent with the others if the pair $(s,a)$ corresponds to pairs $(s_t,a_{t+1})$ and $r$ correponds to $r_{t+1}$ from the sequences. That is, action $a_{t+1}$ is performed in state $s_t$, and the reward $r_{t+1}$ results.
Perhaps I overlooked something. Does this sound right?
|