Dave:
I wholeheartedly agree the exposition in this paper is far from
an ideal elucidation of their ideas. Perhaps it's because I
lacked some of the necessary background, but I really struggled
to understand it in detail.
I don't think they've assumed the mentor and the observer have
the same goal(s), however. As you point out, they stress that the
mentor and the observer don't need to have the same reward
function for their technique to work. The example illustrated in
figures 9 and 10 is their attempt to demonstrate this.
Greg:
This is sort of a fine point, but in figures 3 and 4 I don't
think the mentor is giving misleading information; rather, the
learner's prior beliefs about the mentor's transition
probabilities are doing the damage. I believe the mentor itself
is actually following an optimal policy.
This kind of misunderstanding is easy to make, especially being
that their explanation of how the priors are computed is (for me)
less than adequate.