1. On-line estimation of the optimal value function: HJB estimators, advances in neural information processing systems 5;Peterson,1992
2. Learning to predict by the method of temporal differences;Sutton;Machine Learning,1988
3. First results with Dyna, an integrated architecture for learning, planning and reacting;Sutton,1990
4. Planning by incremental dynamic programming;Sutton,1991
5. Reinforcement learning is direct adaptive optimal control;Sutton,1991