VentureBeat
Follow
Nvidia researchers boost LLMs reasoning skills by getting them to 'think' during pre-training
Researchers at Nvidia have developed a new technique called reinforcement learning pre-training, which integrates reinforcement learning into the initial training phase of large language models. This approach encourages the model to think independently before predicting what comes next, teaching it to reason on plain text without needing external verifiers. The typical training cycle for large language models involves pre-training on vast amounts of text using a next-token prediction objective, followed by a post-training phase where they learn complex reasoning abilities. However, this sequential process does not match human comprehension, which is a parallel integration of input with prior knowledge. The new technique, RLP, reframes the pre-training process by treating chain-of-thought generation as an action the model takes before predicting the next token. The model receives a reward based on how much its thought improved the accuracy of its prediction, eliminating the need for external verifiers or human-labeled data. RLP has shown significant improvements in learning complex reasoning tasks, with models trained with this technique consistently outperforming their conventionally trained counterparts. The benefits of RLP compound instead of disappearing during subsequent fine-tuning stages, and the technique demonstrates impressive scalability and versatility. The researchers believe that RLP points toward a future where pre-training is no longer a monolithic process of next-token prediction, but rather a hybrid of objectives that creates AI that learns to think more robustly from day one. Overall, RLP has the potential to revolutionize the way large language models are trained, enabling them to develop deeper, more structured thinking much earlier in training.