Evaluating language models for mathematics through interactions

Author:

Collins Katherine M.1,Jiang Albert Q.1,Frieder Simon2,Wong Lionel3,Zilka Miri1ORCID,Bhatt Umang145,Lukasiewicz Thomas26ORCID,Wu Yuhuai7,Tenenbaum Joshua B.3,Hart William1ORCID,Gowers Timothy18ORCID,Li Wenda1,Weller Adrian14ORCID,Jamnik Mateja1ORCID

Affiliation:

1. University of Cambridge, Cambridge CB2 1TN, United Kingdom

2. University of Oxford, Oxford OX1 4BH, United Kingdom

3. Massachusetts Institute of Technology, Cambridge, MA 02139

4. The Alan Turing Institute, London NW1 2DB, United Kingdom

5. New York University, New York, NY 10011

6. Vienna University of Technology, Vienna 1040, Austria

7. x.AI, New York, NY 10038

8. Collége de France, Paris 75001, France

Abstract

There is much excitement about the opportunity to harness the power of large language models (LLMs) when building problem-solving assistants. However, the standard methodology of evaluating LLMs relies on static pairs of inputs and outputs; this is insufficient for making an informed decision about which LLMs are best to use in an interactive setting, and how that varies by setting. Static assessment therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analyzing MathConverse, we derive a taxonomy of human query behaviors and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, among other findings. Further, we garner a more granular understanding of GPT-4 mathematical problem-solving through a series of case studies, contributed by experienced mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models that communicate uncertainty, respond well to user corrections, and can provide a concise rationale for their recommendations, may constitute better assistants. Humans should inspect LLM output carefully given their current shortcomings and potential for surprising fallibility.

Funder

Alan Turing Institute

EU TAILOR

Leverhulme Trust

ERC Advanced Grant ALEXANDRIA

EPSRC

Publisher

Proceedings of the National Academy of Sciences

Reference69 articles.

1. R. Bommasani et al. On the opportunities and risks of foundation models. arXiv [Preprint] (2021). https://arxiv.org/2108.07258 (Accessed 1 March 2023).

2. Language models are few-shot learners;Brown T.;NeurIPS,2020

3. H. Touvron et al. LLaMA: Open and efficient foundation language models. arXiv [Preprint] (2023). https://doi.org/10.48550/arXiv.2302.13971 (Accessed 1 March 2023).

4. R. Anil et al. PaLM 2 Technical Report (2023).

5. OpenAI Introducing ChatGPT (2022). https://openai.com/blog/chatgpt. Accessed 1 March 2023.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3