In response to a question asked during my presentation, the experiments with the foraging robots used Q-learning. The measure of correctness shown in the graph was the fraction of condition-behavior pairs that were correct (the optimal policy was calculated by hand). Since there are only a few actions (4), the policy learned without shaping was no better than random.
My slides are posted at
http://www.cs.ucsd.edu/~kbranson/rewardshaping.pptFinally, the top of column two on page five clarifies the confusion I had with the sufficiency proof. If a policy maximizes Q(M'), then it maximizes Q(M) - phi(s), and since phi(s) does not depend on the action to take in state s, it also maximizes Q(M). Therefore, an optimal policy in M' is an optimal policy in M.