## Evaluation datasets

A total of 20 mathematical evaluation datasets, widely used in dozens of top artificial intelligence conferences such as ACL, AAAI, and ICLR since 2010 till now, have been collected. The collected evaluation datasets cover to a certain extent different grades, question types, text forms, and difficulty levels of mathematical problems, which facilitates the provision of more comprehensive and fine-grained mathematical ability evaluation results for LLMs participating in the evaluation.

##### Ability

##### Arith3K

Arith3K is a high-quality arithmetic evaluation dataset created by TAL Education Group, which includes 3 major categories and 15 subcategories of different operators and forms of generalized arithmetic problems, totaling 3,000 questions. It covers a wide range of generalized arithmetic problems, almost including all common operators and calculation forms used in scientific computing. Unlike previous arithmetic evaluation datasets, it significantly increases the difficulty span of problems, from relatively simple 1-5 digit integer arithmetic operations to complex nested operator calculations, fulfilling the arithmetic evaluation needs in different evolutionary stages of LLMs’.

##### Big-Bench-Hard(Math)

Big-Bench-Hard (Math) is a set of math problems involving multiple steps and basic arithmetic operations.

##### math401-llm

math401-llm includes 401 mathematical computation problems, covering basic arithmetic operations, exponentiation, irrational number operations, trigonometric functions, and logarithmic operations. math401-llm constructs arithmetic expressions to test large language models which include Euler equa- tion (eiπ + 1 = 0) as group 0 and 25 problems each for group 1∼16. If not otherwise mentioned, used numbers are positive integers.