## Evaluation datasets

A total of 20 mathematical evaluation datasets, widely used in dozens of top artificial intelligence conferences such as ACL, AAAI, and ICLR since 2010 till now, have been collected. The collected evaluation datasets cover to a certain extent different grades, question types, text forms, and difficulty levels of mathematical problems, which facilitates the provision of more comprehensive and fine-grained mathematical ability evaluation results for LLMs participating in the evaluation.

##### GSM8K

GSM8K is a small-scale elementary school mathematics dataset with a size of 8.5K. It covers basic arithmetic operations and requires 2-8 steps to solve each problem. The dataset consists of a training set of 7.5K examples and a test set of 1K examples. The answers to each question in GSM8K include complete solution processes, which are beneficial for CoT training.

##### MATH

MATH is a dataset which consists of 12, 500 problems(7,500 training, 5,000 testing) from high school math competitions in text-mode Latex. Each problem in MATH has a full step-by-step solution which facilitate CoT training.

##### MMLU(Math)

MMLU(Math) comprises multiple-choice questions across four difficulty levels: abstract algebra, college mathematics, high school mathematics and elementary mathematics. The number of test samples is: 100, 148, 270 and 388.

##### MathQA

MathQA consists of 37,200 multiple-choice math questions, where a unique correct answer needs to be selected from 5 options. The questions are divided into different categories as follows: 3,316 geometry questions, 9,830 physics math questions, 663 probability questions, 4,377 gain-loss questions, 17,796 general math questions, and 1,277 other questions.

##### TAL-SCQ5K-EN

TAL-SCQ5K-EN is a high quality mathematical competition dataset in English language created by TAL Education Group, which consisting of 5K English mathematical competition questions (3K training and 2K testing). The questions are in the form of multiple-choice and cover mathematical topics at the primary school level. In addition, detailed solution steps are provided to facilitate CoT training and all the mathematical expressions in the questions have been presented as standard text-mode Latex format.

##### ASDiv-A

ASDiv-A is a dataset that includes 1218 mathematical problems involving arithmetic operations. The dataset is divided into six difficulty levels ranging from 1 to 6. Additionally, it contains 122 test samples for evaluation purposes. In addition, ASDiv-A is a diverse (in terms of lexicon usage) and wide-coverage (in problem type) MWP dataset.

##### Dolphin1878

Dolphin1878 is a dataset comprising 1878 English mathematical problems collected from both algebra.com and Yahoo websites. Each problem in the dataset contains 1-4 unknown variables. The dataset also includes 187 test samples for evaluation purposes.

##### MAWPS

MAWPS is a dataset consisting of 3320 English mathematical word problems. These problems cover basic arithmetic operations and have been collected from various online educational websites. The dataset also includes 238 test samples for evaluation purposes.

##### SVAMP

SVAMP consists of 1000 mathematical word problems, with each problem containing no more than two mathematical expressions and one-unknown variable. The problems have been created with grade level up to 4 by applying simple variations over word problems in an existing dataset like ASDiv-A.