1. Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., et al. (2019). Solving rubik's cube with a robot hand. arXiv preprint arXiv:1910.07113.
2. Safe reinforcement learning via shielding;Alshiekh,2018
3. Amos, B., Jimenez, I., Sacks, J., Boots, B., and Kolter, J.Z. (2018). Differentiable MPC for end-to-end planning and control. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31, 8289–8300. Curran Associates, Inc.
4. Fixed-horizon temporal difference methods for stable reinforcement learning;Asis,2020
5. Infinite-horizon policy-gradient estimation;Baxter;Journal of Artificial Intelligence Research,2001