1. Jonathan Baxter. 1997. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine learning 28 1 (1997) 7–39. Jonathan Baxter. 1997. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine learning 28 1 (1997) 7–39.
2. Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs). arXiv: Learning (2016). Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs). arXiv: Learning (2016).
3. Geoffrey E Hinton Nitish Srivastava Alex Krizhevsky Ilya Sutskever and Ruslan Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv: Neural and Evolutionary Computing(2012). Geoffrey E Hinton Nitish Srivastava Alex Krizhevsky Ilya Sutskever and Ruslan Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv: Neural and Evolutionary Computing(2012).
4. Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv e-prints Article arXiv:1607.06450 (Jul 2016) arXiv:1607.06450 pages. arxiv:stat.ML/1607.06450 Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv e-prints Article arXiv:1607.06450 (Jul 2016) arXiv:1607.06450 pages. arxiv:stat.ML/1607.06450