1. Bengio Y, Schwenk H, Sencal JS, Morin F, Gauvain JL (2003) A neural probabilistic language model. J Mach Learn Res 3(6):1137–1155
2. Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
3. Brown PF, Desouza PV, Mercer RL, Pietra VJD, Lai JC (1997) Class-based n -gram models of natural language. Comput Linguist 18(4):467–479
4. Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T (2013) One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint; arXiv:1312.3005
5. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537