Home > Information > News
#News ·2025-01-06
Just by changing the variable names of the math problems, large models can collectively become smarter??
Recent research from Stanford University shows that on their latest Putnam-AXIOM test set, the accuracy of the model plummets simply by changing the variable name and variable value range of the original question.
In other words, the mathematical reasoning ability of the large model is not really to master the logic of understanding the problem, it is likely to only retrieve the stored problem...

Even the best-performing o1-preview dropped from 50% to 33.96%, and GPT-4o, Claude, Deepseek, Qwen and other models were almost completely wiped out.

Keep in mind that the robustness of the model's reasoning ability is a very important indicator of whether they have mastered the solution:

Some netizens commented: o1's o is not overfitting o? (doge)

There are enthusiastic netizens to explain that he believes that the search space of the model will increase exponentially with the depth, and the longer the search time, the more difficult the search will be.


The LLM's ability to reason on complex mathematical problems has become a key challenge in model development, whereas existing evaluation benchmarks such as MMLU, MMMU, GSM8K, and MATH face many problems.
On the one hand, data contamination can cause the model to perform falsely high in the evaluation, because the model may have been exposed to problems in the evaluation benchmark during training.
On the other hand, state-of-the-art models have reached or exceeded human levels on many existing benchmarks, making these benchmarks less valuable to evaluate than they should be.
In response, the Stanford team came up with the Putnam-AXIOM benchmark, which is designed to evaluate a model's ability to solve complex mathematical problems.

The benchmark's original dataset covers 236 questions from the William Lowell Putnam Math Competition from 1985-2023.
Just to give you an example:

The questions covered 11 different areas of math, and the team filtered them to ensure they produced oxed{} answers that were easy to automate.
At the same time, they also use the MATH dataset method to evaluate the model, and design an equivalence function to solve the string inconsistency problem and the complex mathematical equivalence homogenization problem.
In addition, in order to prevent evaluation bias caused by Putnam's original problem during the training process, the team also introduced functional variation to construct variation data sets.
Variation is divided into variable changes (changing only the quantity name) and constant changes (modifying numerical properties), which can generate an infinite number of new questions of the same difficulty, and these questions have no ready answer on the Internet.
The specific form of change is like this:

In the experiment, the researchers collated 236 questions from the 1985-2023 competition into a standardized format to evaluate SOTA LLMs for multiple open source models using the LM Harness evaluation framework.
The sample included 236 original questions and 52 variation questions, and the models tested included OpenAI o1-preview, GPT-4o, Claude-3.5 Sonnet, and many other models.
The results of the experiment were somewhat surprising, and the models were not very optimistic.
Let's first look at how the models behave on the raw data set.
The accuracy of most models is less than 10%, and NuminaMath, a winner of the AI Mathematical Olympiad, is only 4.66%, which shows that the Putnam-AXIOM dataset is really difficult.

On the variable data set, the accuracy of the models decreased significantly.
For example, o1-preview, which performed best on the raw data set, had an accuracy rate of 50%, but dropped to 33.96% on the variant data set.
That is, the o1-preview model may have overperformed on the original problem, with previous scores relying more on memory than true reasoning ability.
Claude, who ranked second, had an accuracy of 26.40% on the original data set, but the accuracy dropped to 18.86% on the variant data set, and the scores of the other models also basically dropped.

The team also further analyzed OpenAI o1-preview and GPT-4o's answers.
The results show that their errors are serious, and there are obvious defects in logical reasoning and mathematical rigor.
Here are some examples of Kangkang.
For example, o1-preview fails to provide sufficient proof in answering the question, stating that the maximum possible value of m is n, on the grounds that the upper bound of m is 2n, but it does not explain why a value of m between n and 2n is not feasible.

GPT-4o, on the other hand, has logical jumps and incoherent reasoning, such as in the following problem, which logically jumps directly to the idea that the smallest geometric shape is a rectangle, but does not justify this statement, but defaults it to fact.

DeepSeek's model also made leaps in thinking at key steps, leading to errors in the end result.

It seems that there is still a long way to go to improve the mathematical ability of large models.
But the Putnam-AXIOM benchmark in the Stanford article does alleviate the problem of saturation of existing benchmarks.
It not only provides a very challenging new way to evaluate the mathematical reasoning power of models, but also enables fully automated evaluation and provides a rich and diverse dataset of variants.
The team also said that although the current variant data set generation process is complex and time-consuming, if the variant generation method can be optimized in the future, it will help accelerate the research on artificial reasoning.

Paper: https://openreview.net/forum?id=YXnwlZe0yf¬eId=yrsGpHd0Sf
Code: https://anonymous.4open.science/r/putnam-axiom-B57C/README.md
2025-02-17
2025-02-14
2025-02-13
friend link
400-000-0000
立即获取方案或咨询
top