/m147,
/m148: Evaluating the policy cannot be part of the agent's learning process. It is simply a way for you the programmer to measure the success of the learning process you design.
Since the agent could never evaluate a policy with policy iteration (PI), the learning process cannot decide to stop based on this.
However, after the agent stops learning using a heuristic, then you the programmer can run PI. Your goal is to invent a heuristic that the agent can use to stop as quickly as possible with a policy that is good as possible.