## Evaluation datasets

A total of 20 mathematical evaluation datasets, widely used in dozens of top artificial intelligence conferences such as ACL, AAAI, and ICLR since 2010 till now, have been collected. The collected evaluation datasets cover to a certain extent different grades, question types, text forms, and difficulty levels of mathematical problems, which facilitates the provision of more comprehensive and fine-grained mathematical ability evaluation results for LLMs participating in the evaluation.

##### Ability

##### GSM8K

GSM8K is a small-scale elementary school mathematics dataset with a size of 8.5K. It covers basic arithmetic operations and requires 2-8 steps to solve each problem. The dataset consists of a training set of 7.5K examples and a test set of 1K examples. The answers to each question in GSM8K include complete solution processes, which are beneficial for CoT training.

##### MATH

MATH is a dataset which consists of 12, 500 problems(7,500 training, 5,000 testing) from high school math competitions in text-mode Latex. Each problem in MATH has a full step-by-step solution which facilitate CoT training.

##### GAOKAO-2024

The 2024 Chinese college entrance examination (Gaokao) math test includes 115 multiple-choice questions and 114 fill-in-the-blank and application questions.

##### MMLU(Math)

MMLU(Math) comprises multiple-choice questions across four difficulty levels: abstract algebra, college mathematics, high school mathematics and elementary mathematics. The number of test samples is: 100, 148, 270 and 388.

##### GAOKAO-2023

To ensure a fair evaluation of the models and to guarantee that these large models have not been exposed to the latest data during training, we update our Chinese National College Entrance Examination (Gaokao) dataset leaderboard annually. The Gaokao questions are the most current and are assured to have not been seen before. This time, the 2023 Chinese Gaokao mathematics section we collected contains 110 multiple-choice questions, 47 fill-in-the-blank questions, and two essay questions.

##### GAOKAO(Math)

GAOKAO(Math) is a collection of Chinese high school mathematics questions from the years 2010 to 2022. It consists of six datasets, including multiple-choice questions, fill-in-the-blank questions, and problem-solving questions from both National Exam Paper 1 and National Exam Paper 2. Each question in GAOKAO(Math) is accompanied by a detailed solution process, which is beneficial for CoT training.

##### AGIEval

AGIEval includes five math datasets, which consist of 351 Chinese high school math multiple-choice questions, 118 Chinese high school math fill-in-the-blank questions, 220 English math multiple-choice questions from the SAT, 254 English math multiple-choice questions from the GRE, and 1000 English math fill-in-the-blank questions from the AMC and AIME exams.

##### MathQA

MathQA consists of 37,200 multiple-choice math questions, where a unique correct answer needs to be selected from 5 options. The questions are divided into different categories as follows: 3,316 geometry questions, 9,830 physics math questions, 663 probability questions, 4,377 gain-loss questions, 17,796 general math questions, and 1,277 other questions.

##### Arith3K

Arith3K is a high-quality arithmetic evaluation dataset created by TAL Education Group, which includes 3 major categories and 15 subcategories of different operators and forms of generalized arithmetic problems, totaling 3,000 questions. It covers a wide range of generalized arithmetic problems, almost including all common operators and calculation forms used in scientific computing. Unlike previous arithmetic evaluation datasets, it significantly increases the difficulty span of problems, from relatively simple 1-5 digit integer arithmetic operations to complex nested operator calculations, fulfilling the arithmetic evaluation needs in different evolutionary stages of LLMs’.

##### TAL-SCQ5K-CN

TAL-SCQ5K-CN is a high quality mathematical competition dataset in Chinese language created by TAL Education Group, which consisting of 5K Chinese mathematical competition questions (3K training and 2K testing). The questions are in the form of multiple-choice and cover mathematical topics at the primary,junior high and high school levels. In addition, detailed solution steps are provided to facilitate CoT training and all the mathematical expressions in the questions have been presented as standard text-mode Latex format.

##### TAL-SCQ5K-EN

TAL-SCQ5K-EN is a high quality mathematical competition dataset in English language created by TAL Education Group, which consisting of 5K English mathematical competition questions (3K training and 2K testing). The questions are in the form of multiple-choice and cover mathematical topics at the primary school level. In addition, detailed solution steps are provided to facilitate CoT training and all the mathematical expressions in the questions have been presented as standard text-mode Latex format.

##### Math23K

Math23K is a dataset consisting of 23,164 elementary school Chinese mathematical word problems. These problems are collected from various online educational websites. The questions in this dataset are focused on solving one-variable linear equations. Additionally, the dataset includes 2,317 test samples for evaluation purposes.

##### Ape210K

Ape210K is a dataset comprising 210,488 Chinese mathematical word problems. Each problem in the dataset involves a single unknown variable. The dataset also includes 5,000 test samples for evaluation purposes.

##### ASDiv-A

ASDiv-A is a dataset that includes 1218 mathematical problems involving arithmetic operations. The dataset is divided into six difficulty levels ranging from 1 to 6. Additionally, it contains 122 test samples for evaluation purposes. In addition, ASDiv-A is a diverse (in terms of lexicon usage) and wide-coverage (in problem type) MWP dataset.

##### Big-Bench-Hard(Math)

Big-Bench-Hard (Math) is a set of math problems involving multiple steps and basic arithmetic operations.

##### CMMLU(Math)

CMMLU(Math) consists of 3 levels of Chinese math multiple-choice question datasets, including university math, high school math, and elementary school math. The test sample sizes for these three datasets are 105, 230, and 164 respectively.

##### Dolphin1878

Dolphin1878 is a dataset comprising 1878 English mathematical problems collected from both algebra.com and Yahoo websites. Each problem in the dataset contains 1-4 unknown variables. The dataset also includes 187 test samples for evaluation purposes.

##### math401-llm

math401-llm includes 401 mathematical computation problems, covering basic arithmetic operations, exponentiation, irrational number operations, trigonometric functions, and logarithmic operations. math401-llm constructs arithmetic expressions to test large language models which include Euler equa- tion (eiπ + 1 = 0) as group 0 and 25 problems each for group 1∼16. If not otherwise mentioned, used numbers are positive integers.

##### MAWPS

MAWPS is a dataset consisting of 3320 English mathematical word problems. These problems cover basic arithmetic operations and have been collected from various online educational websites. The dataset also includes 238 test samples for evaluation purposes.

##### SVAMP

SVAMP consists of 1000 mathematical word problems, with each problem containing no more than two mathematical expressions and one-unknown variable. The problems have been created with grade level up to 4 by applying simple variations over word problems in an existing dataset like ASDiv-A.