1. Barr, D. J., Levy, R., Scheepers, C., & Tilly, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. Available from: https://doi.org/10.1016/j.jml.2012.11.001
2. Berg-Kirkpatrick, T., Burkett, D., & Klein, D. (2012). An empirical investigation of statistical significance in NLP. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Available from: https://aclanthology.org/D12-1091
3. Canty, A. J., Davison, A. C., Hinkley, D. V., & Ventura, V. (2006). Bootstrap diagnostics and remedies. The Canadian Journal of Statistics, 34(1), 5–27. Available from: http://dx.doi.org/10.1002/cjs.5550340103
4. Card, D., Henderson, P., Khandelwal, U., Jia, R., Mahowald, K., & Jurafsky, D. (2020). With little power comes great responsibility. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Virtual. Available from: http://dx.doi.org/10.18653/v1/2020.emnlp-main.745
5. Clark, J., Dyer, C., Lavie, A., and Smith, N. (2011). Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). Available from: https://aclanthology.org/P11-2031