## Evaluation datasets

A total of 20 mathematical evaluation datasets, widely used in dozens of top artificial intelligence conferences such as ACL, AAAI, and ICLR since 2010 till now, have been collected. The collected evaluation datasets cover to a certain extent different grades, question types, text forms, and difficulty levels of mathematical problems, which facilitates the provision of more comprehensive and fine-grained mathematical ability evaluation results for LLMs participating in the evaluation.

##### Ability

##### GAOKAO-2024

The 2024 Chinese college entrance examination (Gaokao) math test includes 115 multiple-choice questions and 114 fill-in-the-blank and application questions.

##### GAOKAO-2023

To ensure a fair evaluation of the models and to guarantee that these large models have not been exposed to the latest data during training, we update our Chinese National College Entrance Examination (Gaokao) dataset leaderboard annually. The Gaokao questions are the most current and are assured to have not been seen before. This time, the 2023 Chinese Gaokao mathematics section we collected contains 110 multiple-choice questions, 47 fill-in-the-blank questions, and two essay questions.

##### GAOKAO(Math)

GAOKAO(Math) is a collection of Chinese high school mathematics questions from the years 2010 to 2022. It consists of six datasets, including multiple-choice questions, fill-in-the-blank questions, and problem-solving questions from both National Exam Paper 1 and National Exam Paper 2. Each question in GAOKAO(Math) is accompanied by a detailed solution process, which is beneficial for CoT training.

##### AGIEval

AGIEval includes five math datasets, which consist of 351 Chinese high school math multiple-choice questions, 118 Chinese high school math fill-in-the-blank questions, 220 English math multiple-choice questions from the SAT, 254 English math multiple-choice questions from the GRE, and 1000 English math fill-in-the-blank questions from the AMC and AIME exams.

##### TAL-SCQ5K-CN

TAL-SCQ5K-CN is a high quality mathematical competition dataset in Chinese language created by TAL Education Group, which consisting of 5K Chinese mathematical competition questions (3K training and 2K testing). The questions are in the form of multiple-choice and cover mathematical topics at the primary,junior high and high school levels. In addition, detailed solution steps are provided to facilitate CoT training and all the mathematical expressions in the questions have been presented as standard text-mode Latex format.

##### Math23K

Math23K is a dataset consisting of 23,164 elementary school Chinese mathematical word problems. These problems are collected from various online educational websites. The questions in this dataset are focused on solving one-variable linear equations. Additionally, the dataset includes 2,317 test samples for evaluation purposes.

##### Ape210K

Ape210K is a dataset comprising 210,488 Chinese mathematical word problems. Each problem in the dataset involves a single unknown variable. The dataset also includes 5,000 test samples for evaluation purposes.

##### CMMLU(Math)

CMMLU(Math) consists of 3 levels of Chinese math multiple-choice question datasets, including university math, high school math, and elementary school math. The test sample sizes for these three datasets are 105, 230, and 164 respectively.