Dave: I think it makes sense to distinguish between observations
and states when the environment is only partially observable; the
observation is just the visible part of the state.
It's odd to me that none of the scenarios in the figures looks
like a Nash equilibrium, yet the authors write of both examples
that ``the algorithm consistently settled on the solution
shown''. For some reason, though, the policy selected for the
first example is only ``approximately uniform''. Why?
The second example seems to have even stranger irregularities.
Shouldn't the desired policies, votes, and resultant policy
reflect the symmetry in the state diagram? Why then, for example,
do both sources agree the agent should move left in states 2 and
3, but not that it should move right in states 7 and 8? Not only
do the authors not explain this, but they completely neglect to
mention it.