Reinforcement Learning in Noisy and Non-Markovian Domains


Mark Pendrith
School of Computer Science and Engineering, The University of New South Wales, Sydney 2052, Australia.
pendrith@cse.unsw.edu.au


Abstract

If reinforcement learning techniques are to be used for ``real world'' dynamic system control, the problems of noise and plant disturbance will have to be addressed. This study investigates the effects of noise/disturbance on five different RL algorithms: Watkins' 1-step Q-Learning; Barto, Sutton and Anderson's Adaptive Heuristic Critic; Sammut and Law's modern variant of Michie and Chambers' BOXES algorithm; and two new algorithms developed during the course of this study. Both these new algorithms are conceptually related to Q-Learning; both algorithms, called P-Trace and Q-Trace respectively, provide for substantially faster learning than straight Q-Learning overall, and for dramatically faster learning (by up to a factor of 200) in the special case of learning in a noisy environment for the dynamic system studied here (a pole-and-cart simulation).

As well as speeding learning, both the P-Trace and Q-Trace algorithms have been designed to preserve the ``convergence with probability 1'' formal properties of standard Q-Learning, i.e. that they be provably ``correct'' algorithms for Markovian domains for the same conditions that Q-Learning is guaranteed to be correct. We present both arguments and experimental evidence that ``actual return'' methods may prove to be both faster and more powerful in general than temporal difference methods. The potential performance improvements using actual return over pure temporal difference methods may turn out to be particularly important when learning is to occur in noisy or stochastic environments, and in the case where the domain is not well-modelled by Markovian processes.


Conference Home Page