OpenAI's biggest secret has been cracked by Chinese researchers? Fudan and other amazing uncover o1 road map

#News ·2025-01-06

Just today, a domestic paper has shocked AI scholars around the world.

Many netizens said that what is the principle behind the OpenAI o1 and o3 models - this unsolved mystery has been "discovered" by Chinese researchers!

图片

图片

Note: The authors provide a theoretical analysis of how to approximate such models and do not claim to have "cracked" the problem

In fact, in this 51-page paper, researchers from Fudan University, among others, analyze a roadmap for implementing o1 from the perspective of reinforcement learning.

There are four key components to focus on: policy initialization, reward design, search, and learning.

In addition, as part of the roadmap, the researchers also summarized the existing "open source version o1" project.

图片

Address: https://arxiv.org/abs/2412.14135

图片

Explore OpenAI's 'mystery of AGI'

In a nutshell, inference models like o1 can be thought of as a combination of LLM and AlphaGo.

First, models need to be trained on "Internet data" so that they can understand text and reach a certain level of intelligence.

Then, add reinforcement learning methods to make them "think systematically."

Finally, in the process of finding an answer, the model "searches" the solution space. This approach is used both for the actual "when tested" answers and for improving the model, i.e., "learning."

图片

Notably, Stanford and Google's 2022 "STaR: Self-Taught Reasoner" paper suggested that the "inference process" generated by LLMS before answering questions could be used to fine-tune future models, thereby improving their ability to answer such questions.

STaR allows AI models to "guide" themselves to higher levels of intelligence by repeatedly generating their own training data, an approach that could, in theory, allow language models to surpass human-level intelligence.

Therefore, the concept of having the model "analyze the solution space deep" plays a key role in both the training phase and the testing phase.

图片

In this work, the researchers mainly analyzed the implementation of o1 from the following four aspects: policy initialization, reward design, search, and learning.

Policy initialization

Strategy initialization enables the model to develop "human-like reasoning behavior" and thus has the ability to efficiently explore the solution space of complex problems.

  • Massive text data pre-training
  • Command fine-tuning
  • Problem analysis, task decomposition and self-correction learning ability

图片

Reward design

Reward design provides intensive and effective signals through reward shaping or modeling to guide the learning and searching process of the model.

  • Result reward (based on final result)
  • Process rewards (based on intermediate steps)

图片

Outcome Reward (left) and Process reward (right)

search

Search plays a crucial role in both training and testing, i.e. more computing resources can generate better solutions.

  • Tree search methods such as MCTS explore a variety of solutions
  • Successive revision iterations improve the answer
  • A combination of both approaches may be the best option

图片

Types of guidance used in the search process: internal guidance, external guidance, and a combination of both

study

Learning from human expert data requires expensive data annotation. In contrast, reinforcement learning learns through interaction with the environment, avoids the high cost of data annotation, and has the potential to achieve performance beyond humans.

  • Policy gradient approaches such as PPO and DPO
  • Clone behavior from high quality search solutions
  • Iterate search and learning cycles

图片

In summary, as the researchers suspected in November 2023, the next LLM breakthrough is likely to be some combination with Google Deepmind's Alpha series (such as AlphaGo).

The significance of this research is more than just publishing a paper, it also opens the door for most models, allowing others to use RL to implement the same concept, providing different types of reasoning feedback, while also developing playbooks and recipes that AI can use.

图片

"Open Source o1"

The researchers conclude that although o1 has not yet published a technical report, the academic community has offered several open source implementations of o1.

In addition, there are several O1-like models in industry, such as k0-math, skywork-o1, Deepseek-R1, QwQ, and InternThinker.

  • g1: This study is probably the first attempt to re-implement o1.
  • Thinking Claude: Similar to g1, but it hints at LLM through more complex and fine-grained operations.
  • Open-o1: The project proposes an SFT dataset where each response contains CoT. The researchers speculate that the data may have come from a human expert or a powerful LLM.
  • The o1 Journey is described in detail in two technical reports. The first part is traversed through the tree data generated by the beam search, and specific nodes are optimized by GPT-4 for SFT, a strategy that can be described as expert iteration. The second part attempts to distill o1-mini and restore the hidden CoT process by prompt.
  • Open-Reasoner: The framework is similar to AlphaGo, improving model performance through reinforcement learning.
  • Slow Thinking and LLM: The research is also divided into two technical reports. The first part is similar to Open-Reasoner, combining reinforcement learning and search while testing. The second part distills from QwQ and Deepseek-R1 and tries two reinforcement learning methods.
  • Marco-o1: The project combines Open-o1 data with the data generated by the model itself through the MCTS algorithm for SFT training.
  • O1-coder: The project attempts to re-implement o1 in the area of code generation.

图片

A comparison of approaches from different open source o1 projects in the areas of policy initialization, reward design, search, and learning

Policy initialization

In reinforcement learning, policies define how the agent chooses actions based on the state of the environment.

LLM action granularity is divided into three levels: solution level, step level, and Token level.

图片

The interaction between agent and environment in LLM reinforcement learning

The initialization process of LLM mainly includes two stages: pre-training and instruction fine-tuning.

In the pre-training stage, the model develops a basic language understanding ability through self-supervised learning of a large-scale networked corpus, and follows the established power law between computing resources and performance.

In the instruction fine-tuning phase, the LLM is transformed from a simple prediction of the next Token to generating responses that are aligned with human needs.

For models like o1, incorporating human-like reasoning behavior is essential for more complex solution space exploration.

pre-training

Pre-training builds basic language understanding and reasoning skills for LLM through exposure to a large corpus of texts.

For O1-like models, these core competencies are the basis for the development of advanced behaviors in subsequent learning and search.

  • Language understanding and production: Language understanding develops hierarchically - syntactic patterns emerge early, while logical coherence and abstract reasoning develop in later stages of training. Therefore, in addition to model size, training duration and data composition are also critical.
  • World knowledge acquisition and storage: Knowledge storage has efficient compression and generalization characteristics, while abstract concepts require more extensive training than factual knowledge.
  • Basic reasoning skills: Pre-training develops basic reasoning skills through a variety of reasoning models, which emerge in a hierarchy from simple inference to complex reasoning.

Command fine-tuning

Instruction fine-tuning transforms pre-trained language models into task-oriented agents by specialized training on multi-domain instruction-response pairs.

This process changes the behavior of the model from a mere prediction of the next Token to one with a clear purpose.

The effect depends mainly on two key factors: the diversity of the instruction data set and the quality of the instruction-response pairs.

Human-like reasoning behavior

While instruction-fine-tuned models demonstrate general-purpose task capabilities and user intent understanding, models like o1 require more complex human-like reasoning capabilities to reach their full potential.

As shown in Table 1, researchers analyzed o1's behavior patterns and identified six types of human reasoning behaviors.

图片

  • Problem analysis: Problem analysis is a critical initialization process where the model reformulates and analyzes the problem before solving it.
  • Task decomposition: When faced with a complex problem, humans usually break it down into a number of manageable subtasks.
  • Task completed: After that, the model generates the solution through step-by-step reasoning based on identifying the problem and breaking down the subtasks.
  • Alternatives: The ability to generate diverse alternative solutions is especially important when faced with reasoning barriers or train of thought breaks. As shown in Table 1, o1 demonstrates this ability in password cracking, being able to systematically propose multiple options.
  • Self-assessment: After the task is completed, self-assessment serves as a key verification mechanism to confirm the correctness of the proposed solution.
  • Self-correction: When there are controllable errors in reasoning, the model adopts self-correcting behaviors to solve these problems. In the demonstration of o1, when a signal such as "No" or "Wait" is encountered, the correction process is triggered.

Speculation about o1 policy initialization

Policy initialization plays a key role in developing O1-like models, as it establishes fundamental capabilities that influence subsequent learning and search processes.

The strategy initialization stage consists of three core components: pre-training, instruction fine-tuning, and the development of human-like reasoning behavior.

Although these reasoning behaviors are already implicitly present in the LLM after the fine-tuning of instructions, their effective deployment needs to be activated by supervised fine-tuning or carefully crafted prompt words.

  • Long text generation capability: In the inference process, LLM requires fine modeling capability of long text context.
  • Rational modeling of human-like reasoning behavior: Models also need to develop the ability to order human-like reasoning behavior in a logically coherent manner.
  • Self-reflection: Behaviors such as self-assessment, self-correction, and alternative proposal can be seen as manifestations of the model's ability to self-reflect.

Reward design

In reinforcement learning, the agent receives reward feedback signals from the environment and maximizes its long-term reward by improving its strategy.

The reward function is usually expressed as r(st, at), which represents the reward obtained by the agent for performing the action at at the state st of time step t.

The reward feedback signal is crucial in training and reasoning because it specifies the desired behavior of the agent through numerical scoring.

Result reward and process reward

Result rewards are assigned points based on whether the LLM output meets predefined expectations. However, the lack of supervision of the intermediate steps may cause the LLM to generate the wrong solution steps.

In contrast to the outcome reward, the process reward not only provides a reward signal for the final step, but also provides a reward for the intermediate step. Despite showing great potential, the learning process is more challenging than the reward of results.

Reward design method

Since outcome rewards can be viewed as a special case of process rewards, many reward design methods can be applied to modeling both outcome rewards and process rewards.

These models are often referred to as the Outcome Reward Model (ORM) and Process Reward Model (PRM).

  • Reward from the environment: The most direct approach to reward design is to directly utilize the reward signals provided by the environment, or learn a model to simulate the reward signals in the environment.
  • Modeling rewards from data: For some environments, reward signals in the environment are not available and cannot be simulated. It is easier to collect expert data or preference data than to offer rewards directly. From this data, a model can be learned that provides effective rewards.

Reward shaping

In some environments, reward signals may not effectively communicate learning goals.

In this case, rewards can be redesigned through reward shaping to make them richer and more informative.

However, since the value function depends on the strategy π, a value function estimated from one strategy may not be suitable as a reward function for another strategy.

Speculation about o1 reward design

Given o1's ability to handle multitasking reasoning, its reward model may incorporate multiple reward design approaches.

For complex reasoning tasks such as math and code, because the answers to these tasks often involve long chains of reasoning, a process reward model (PRM) is more likely to be used to oversee intermediate processes than an outcome reward model (ORM).

When reward signals are not available in the environment, the researchers speculate that o1 may rely on learning from preference data or expert data.

According to OpenAI's AGI five-stage plan, o1 is already a powerful inference model, and the next stage is to train an agent that can interact with the world and solve real-world problems.

To achieve this, a reward model is needed that provides reward signals for the agent's behavior in a real environment.

  • Reward Integration: An intuitive way to build reward signals for generic tasks is through domain-specific reward integration.
  • World model: The world model can not only provide reward signals, but also predict the next state. It has been suggested that a video generator can be used as a model of the world because it is able to predict images of future time steps.

search

For models like o1 designed to solve complex inference tasks, search may play an important role in both training and inference processes.

Search guide

Search based on internal guidance does not rely on real feedback from the external environment or the agent model, but rather guides the search process through the model's own state or ability to evaluate.

External guidance typically does not rely on a specific policy, relying only on environment - or task-related signals to guide the search process.

At the same time, internal and external guidance can be combined to guide the search process, a common approach is to combine the uncertainty of the model itself with the agent feedback from the reward model.

Search strategy

Researchers classify search strategies into two types: tree search and sequence correction.

Tree search is a global search method that generates multiple answers at the same time for exploring a wider range of solutions.

In contrast, sequence correction is a local search method that progressively optimizes each attempt based on previous results and may be more efficient.

Tree search is usually suitable for solving complex problems, while sequence correction is more suitable for fast iterative optimization.

图片

Search for roles in o1

The researchers believe that search plays a crucial role in both o1's training and reasoning.

They refer to the two stages of search as training-time search and inference-time search, respectively.

In the training phase, the trial-and-error process in online reinforcement learning can also be viewed as a search process.

In the inference phase, o1 shows that model performance can be continuously improved by increasing the amount of inference computation and extending the thinking time.

The researchers believe that o1's "think more" approach can be viewed as a search, using more reasoning computing time to find a better answer.

Speculation about the o1 search

  • Training phase search: During training, o1 is more likely to employ tree search techniques, such as BoN or tree search algorithms, and rely primarily on external guidance.
  • Inference stage search: During inference, o1 is more likely to use sequence correction, combined with internal guidance, to continuously optimize and revise its search process through reflection.

As you can see from the examples in the o1 blog, o1's reasoning style is closer to sequence correction. There is every indication that o1 relies primarily on internal guidance during the reasoning phase.

图片

study

Reinforcement learning typically uses strategies to sample trajectories and improve strategies based on the rewards obtained.

In the context of o1, the researchers hypothesize that the reinforcement learning process generates a trajectory through a search algorithm, rather than relying solely on sampling.

Based on this assumption, reinforcement learning for o1 may involve an iterative process of search and learning.

In each iteration, the learning phase uses the output generated by the search as training data to enhance the strategy, and the improved strategy is then applied to the search process in the next iteration.

The search in the training phase is different from the search in the test phase.

The researchers write the set of state-action pairs for the search output as D_search, and the set of state-action pairs for the optimal solution in the search as D_expert. Therefore, D_expert is a subset of D_search.

Learning method

Given D_search, the strategy can be improved by policy gradient method or behavior cloning.

Proximal Policy Optimization (PPO) and Direct Policy Optimization DPO are the most commonly used reinforcement learning techniques in LLM. In addition, it is common practice to perform behavioral cloning or supervised learning on search data.

Researchers believe that o1 learning may be the result of a combination of multiple learning methods.

In this framework, they hypothesize that o1's learning process begins with the warm-up phase of the use of behavioral cloning, and moves to the use of PPO or DPO when the improvement effect of behavioral cloning stabilizes.

This process is consistent with the post-training strategy used in LLama2 and LLama3.

图片

Scaling Law of reinforcement learning

In the pre-training phase, the relationship between loss, computational cost, model parameters and data size follows a power-law Scaling Law. So, does reinforcement learning also show up?

According to OpenAI's blog, inference performance does have a log-linear relationship with the amount of computation in training time. Beyond that, however, there isn't much research.

In order to realize large-scale reinforcement learning like o1, it is crucial to study the Scaling Law of LLM reinforcement learning.

TAGS:

  • 400-000-0000

  • No. xx, xxx Street, Suzhou City, Jiangsu Province

  • 123123@163.com

  • wechat

  • WeChat official account

Copyright © 2011-2024 苏州竹子网络科技有限公司 版权所有 ICP:苏ICP备88888888号

friend link