Home > Information > News
#News ·2025-01-06
OpenAI researcher Jason Wei has just published a blog post exploring an underappreciated but crucial skill in current AI research: finding data sets that truly demonstrate the effectiveness of new approaches. This skill did not exist a decade ago, but today it can be the difference between a study's success and failure.

A common example is "On which data sets does the Chain of Thought (CoT) improve performance?" . One recent paper even suggests that CoT is helpful mainly for math and logic tasks. Wei sees this view as a sign of a lack of imagination and diverse assessments. If we simply tested the CoT model on 100 random user chat prompts, we might not see a significant difference, but that's simply because these prompts would have been solved without CoT. In fact, there are certain subsets of data on which CoT can provide significant improvements - such as math and programming tasks, as well as any task that validates asymmetries.
In other words, before asserting that "method X is invalid," you need to make sure that the data set used for testing actually reflects the value of the method.
This blog post by Jason Wei highlights how in current AI research, the choice of data sets is becoming more nuanced and critical as model capabilities continue to grow.
Jason Wei AI researcher @OpenAI
An underrated but occasionally make-or-break skill in AI research (that didn’t really exist ten years ago) is the ability to find a dataset that actually exercises a new method you are working on. Back in the day when the bottleneck in AI was learning, many methods were dataset-agnostic; for example, a better optimizer would be expected to improve on both ImageNet and CIFAR-10. Nowadays language models are so multi-task that the answer to whether something works is almost always “it depends on the dataset”.
An underrated but occasionally make-or-break skill in AI research (which didn't exist a decade ago) is the ability to find a data set that really tests the new approach you're working on. In the past, the bottleneck of AI was learning, and many methods were not related to data sets; For example, a better optimizer should improve performance on both ImageNet and CIFAR-10. Nowadays, language models are so capable of multitasking that the answer to whether something works or not almost always "depends on the data set."
A common example of this is the question, “on what datasets does chain of thought improve performance?” A recent paper even argued (will link below) that CoT mainly helps on math/logic, and I think that is both a failure of imagination and a lack of diverse evals. Naively you might try CoT models on 100 random user chat prompts and not see much difference, but this is because the prompts were already solvable without CoT. In fact there is a small and very important slice of data where CoT makes a big difference—the obvious examples are math and coding, but include almost any task with asymmetry of verification. For example, generating a poem that fits a list of constraints is hard on the first try but much easier if you can draft and revise using CoT.
A common example is the question: "On which data sets does the Chain of Thought (CoT) improve performance?" One recent paper even argues (link attached) that CoT primarily helps with math/logic, which I think is both a failure of imagination and the result of a lack of diverse assessment. You might simply try the CoT model on 100 random user chat prompts and not see much of a difference, but that's because those prompts can already be solved without CoT. In fact, CoT can make a big difference on a small set of very important data - the obvious examples are math and coding, but also almost any task with validation asymmetries. Generating a poem that fits a set of constraints, for example, is difficult on the first try, but much easier if you can draft and revise it using CoT.
As another made-up example, let’s say you want to know if browsing improves performance on geology exams. Maybe using browsing on some random geology dataset didn’t improve performance. The important thing to do here would be to see if the without-browsing model was actually suffering due to lack of world knowledge—if it wasn’t, then this was the wrong dataset to try browsing on.
To take another fictional example, suppose you want to know if browsing the web can improve your score on a geology test. Perhaps using browsing on some random geological dataset did not improve performance. It is important here to see if the model without browsing is really underperforming due to a lack of world knowledge - if not, then this is not the right data set to test browsing.
In other words you should hesitate to draw a conclusion like “X method doesn’t work” without ensuring that the dataset used for testing actually exercises that method. The inertia from five years ago is to take existing benchmarks and try to solve them, but nowadays there is a lot more flexibility and sometimes it even makes sense to create a custom dataset to showcase the initial usefulness of an idea. Obviously the danger with doing this is that a contrived dataset may not represent a substantial portion of user queries. But if the method is in principle general I think this is a good way to start and something people should do more often.
In other words, before jumping to the conclusion that "method X is invalid," you should be careful to make sure that the data set used for testing actually validates the method. The inertia five years ago was to take existing benchmark datasets and try to solve them, but today there is much more flexibility, sometimes even creating a custom dataset to demonstrate the initial utility of an idea. Obviously, the danger with this is that artificially designed datasets may not represent a significant portion of user queries. But if the method is universal in principle, I think it's a good start and something people should try more.
2025-02-17
2025-02-14
2025-02-13
friend link
400-000-0000
立即获取方案或咨询
top