Reinforcement Learning in Noisy and Non-Markovian Domains
Mark Pendrith
School of Computer Science and Engineering,
The University of New South Wales,
Sydney 2052, Australia.
pendrith@cse.unsw.edu.au
Abstract
If reinforcement learning techniques are to be used for ``real world''
dynamic system control, the problems of noise and plant disturbance will have
to be addressed. This study investigates the effects of noise/disturbance on
five different RL algorithms: Watkins' 1-step Q-Learning; Barto, Sutton
and Anderson's Adaptive Heuristic Critic; Sammut and Law's modern variant
of Michie and Chambers' BOXES algorithm; and two new algorithms developed
during the course of this study. Both these new algorithms are conceptually
related to Q-Learning; both algorithms, called P-Trace and Q-Trace
respectively, provide for substantially faster learning than straight
Q-Learning overall, and for dramatically faster learning (by up to a factor
of 200) in the special case of learning in a noisy environment for the
dynamic system studied here (a pole-and-cart simulation).
As well as speeding learning, both the P-Trace and Q-Trace algorithms have
been designed to preserve the ``convergence with probability 1'' formal
properties of standard Q-Learning, i.e. that they be provably ``correct''
algorithms for Markovian domains for the same conditions that Q-Learning is
guaranteed to be correct. We present both arguments and experimental evidence
that ``actual return'' methods may prove to be both faster and more powerful
in general than temporal difference methods. The potential performance
improvements using actual return over pure temporal difference methods may
turn out to be particularly important when learning is to occur in noisy or
stochastic environments, and in the case
where the domain is not well-modelled by Markovian processes.
Conference Home Page