Enhancing Reasoning in LLMs with Reinforcement Learning and Search Integration

No Image
No Image
Source Link

AI Developer Philipp Schmid from Google DeepMind has introduced an innovative approach to enhancing the reasoning capabilities of large language models (LLMs) through reinforcement learning (RL). In his paper, titled "ReSearch," Schmid explores how LLMs can be trained to incorporate search operations into their reasoning processes, improving multi-hop question answering without relying on supervised reasoning step data. The methodology includes a special tag format (<think>, <search>, <result>), and uses an instruction-tuned LLM like Qwen2.5 as the policy model. A reinforcement learning environment is created, possibly using the Generalized Proximal Policy Optimization (GRPO) algorithm, and an external search tool is integrated. Prompts are crafted to guide the LLM in adhering to the reasoning format, with a rule-based reward function assessing the final answer's accuracy and compliance with the format. The training process masks retrieved search results during loss calculation to emphasize strategic search actions. Preliminary results demonstrate the efficacy of GRPO-optimized models in incorporating search into complex reasoning tasks, outperforming traditional RAG methods and exhibiting strong generalization across benchmarks like the MuSiQue dataset. Schmid's research underscores potential broader applications, suggesting a future where LLMs skillfully integrate tool usage within their reasoning frameworks.